Wednesday, August 13, 2025
HomeTechnologyAgents far from being ready for autonomous work in the office

Agents far from being ready for autonomous work in the office

A study by Carnegie Mellon and the Allen Institute for AI shows that, within the framework of a simulation, agents based on generative AI still fail in the majority of professional tasks.

Researchers from Carnegie Mellon and Allen Institute for AI have evaluated the real capacities of generative artificial intelligence agents in a simulated professional environment. Their study, published on the Arxiv scientific prepublication platform, reveals that even the most efficient models fail in the majority of tasks, questioning their current efficiency to automate complex office functions.

To carry out this evaluation, the authors designed an open source platform called Theagentcompany, simulating a software business. The environment includes realistic tools such as Gitlab, Rocketchat or Owncloud, accommodated locally. This approach aims to test the agents under conditions close to those encountered in real professional contexts.

Less than 30% autonomous success

Several AI agents, based on LLMs such as Gemini 2.5 Pro, GPT-4 or Claude 3.7 Sonnet, have been responsible for collaborating to accomplish 175 standard missions, covering areas such as software development, project management, human resources or finance.

The results are clear: no agent has performed more than 30% of tasks independently. By including the partially successful missions, Gemini 2.5 Pro appears as the most efficient model, with a success rate of 30%, according to the report.

By way of comparison, GPT-4O only succeeded in 8% of the missions while Claude 3.7 Sonnet reaches 26% weighted success. As for open source models, Llama 3.1 (405b) caps 7% success.

Limits marked on complex tasks

The performance of agents vary strongly depending on the nature of the missions. They are relatively effective for software development tasks, while the results fall in the face of missions falling under finance, human resources or administrative functions. Yet considered less complex for humans, the latter pose more difficulties for AI agents, in particular due to the lack of training data available on these specific professional contexts.

Researchers identify several recurrent causes of failure: agents are struggling to manage time, prioritize priorities or maintain a coherent work logic on several stages. They also encounter difficulties in interacting with complex interfaces or properly interpreting social exchanges. Some agents, when they are uncertain, generate apparently plausible responses but unsuitable for the context, or try to “short-circuit” the stages of a task by bypassing the instructions, which harms their reliability in demanding professional environments.

Decreased expectations

These results echo Gartner’s prudent forecasts last June, according to which more than 40% of the projects involving IA agents will be abandoned by 2027, in particular due to an uncertain commercial value and an underestimated complexity. The cabinet also warns of the phenomenon of agent Washing, which consists in presenting conventional tools (assistants, RPA) as real AI agents, while few solutions are capable of operating independently on complex tasks.

The authors of the study believe that progress in language models are not enough alone. According to them, these models must now be combined with better long -term reasoning capacities, more robust interfaces and more effective collaboration mechanisms to hope to cross a CAP in terms of automation.

According to the study, if generative AI can support certain processes, they remain far from being able to reliably automate complex professional functions. The Theagentcompany benchmark is available online to allow other researchers or companies to test their own agents.

lennon.ross
lennon.ross
Lennon documents adaptive-sports triumphs, photographing wheelchair-rugby scrums like superhero battles.
Facebook
Twitter
Instagram
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments