TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks
TheAgentCompany is the most realistic enterprise agent benchmark I've read so far in this series. It comes from Graham Neubig's group at CMU and was submitted to NeurIPS 2024, motivated by a clear gap: existing benchmarks test isolated web navigation or GitHub issue resolution, but real workplace tasks require agents to browse internal platforms, message colleagues, write code, and run programs within a single task. I'm reading it now because it's the closest controlled experiment we have on whether LLM agents can actually function as digital coworkers in a consequential setting.
The paper
Xu et al. construct a self-contained simulated company: a local workspace plus an intranet running real instances of GitLab, OwnCloud, Plane (project management), and RocketChat (team messaging). The environment also includes simulated colleagues—NPCs backed by LLMs—so agents can send messages and receive guidance mid-task. Tasks span seven role categories: software development engineering (SDE), project management, HR, data science, finance, admin, and a catch-all "other." The total is 175 tasks, curated by 20 computer science students and software engineers over approximately 3,000 person-hours across two months.
Evaluation uses a checkpoint system: each task has intermediate milestones worth a fraction of the total score, plus a bonus for full completion. Assessors are either deterministic (checking file contents, code outputs, environment state) or LLM-based (evaluating free-form text). All models run under the OpenHands agent framework, which provides code execution, web browsing, and terminal access from a single configurable harness.
Key ideas
- Gemini-2.5-Pro leads at 30.3% full completion and 39.3% partial score; Claude-3.7-Sonnet follows at 26.3% / 36.4%; GPT-4o reaches only 8.6% / 16.7%; Llama-3.1-405B manages 7.4%.
- The best model averages roughly 27 agent steps and costs over $4 per task—even for tasks the authors describe as simpler than real workplace complexity.
- Finance tasks are among the hardest categories, alongside admin and data science; SDE tasks are reliably the easiest despite requiring more specialized technical knowledge.
- Three failure modes dominate: navigating complex web UIs (especially OwnCloud's office suite), failing to productively use colleague messages ("lack of social skills"), and abandoning multi-document admin tasks that require tedious cross-referencing.
- The authors attribute the SDE advantage directly to training data bias: LLM pretraining skews heavily toward code and GitHub data due to prominent benchmarks and abundant public training signal, so models generalize far better to software tasks than to HR or finance workflows.
What holds up — and what doesn't
The environment design is genuinely impressive. Running real GitLab, OwnCloud, and RocketChat rather than simulated stubs means agents face authentic UI complexity—real popups, auth flows, and edge cases. The checkpoint-based partial scoring is also the right call: binary success/fail would make most tasks look uniformly hopeless, obscuring where agents actually make progress.
That said, several weaknesses are worth flagging. Most critically, there is no human performance baseline. The authors acknowledge this—resource constraints precluded collecting human timings or success rates—which means we have no denominator. 30% agent completion sounds bad, but without knowing whether a human would spend 20 minutes or 3 hours on the same task, or whether some tasks are genuinely ambiguous, the number is hard to contextualize.
The finance category has only 12 tasks. That's too small to draw robust conclusions about finance-specific failures. Are agents worse at finance because of some property of financial reasoning, or because the finance tasks happen to involve more OwnCloud document navigation? The paper can't disambiguate at this scale, and the authors don't try to.
The authors also acknowledge that tasks "are generally on the more straightforward side due to the need to automatically evaluate with programs and test cases." The hardest real accounting or finance tasks—preparing a year-end reconciliation from inconsistent source data, flagging regulatory compliance issues, producing a management report across multiple ledger periods—are essentially impossible to autoevaluate. The benchmark likely undersamples exactly the tasks that would matter most for autonomous finance agents.
Why this matters for finance AI
The results here are sobering in a useful way. A 30% completion rate on tasks the authors call simplified means autonomous agents are nowhere near operational for real accounting workflows. The finance category is specifically weak, and the dominant failure modes—complex UIs, multi-document retrieval, communication breakdown with human counterparts—are precisely the skills a Beancount automation agent would need: pulling data from document storage, cross-referencing transactions across reports, and asking clarifying questions before committing writes.
The $4-per-task cost for the best model is a forcing function. At that rate, running an agent on a routine month-end close involving dozens of sub-tasks would cost hundreds of dollars with no reliability guarantee. Gemini-2.0-Flash's pattern of cutting losses early—achieving 19.0% partial score at under $1 per task—suggests there is real engineering value in knowing when to stop and escalate rather than burning tokens on a failing trajectory.
The simulated-colleague NPCs are an interesting design primitive that maps directly to Beancount's real constraint: agents that ignore user feedback and proceed with wrong assumptions are more dangerous than agents that halt and ask. The benchmark's finding that current models fail to extract useful information from colleague messages should be a direct design input for any write-back agent that interacts with a human accountant mid-session.
What to read next
- OpenHands: An Open Platform for AI Software Developers as Generalist Agents — the agent framework underlying TheAgentCompany; arXiv:2407.16741, ICLR 2025. Understanding OpenHands' CodeAct + browsing architecture clarifies which agent capabilities are baseline versus what TheAgentCompany is actually testing.
- DocFinQA: A Long-Context Financial Reasoning Dataset — extends 7,437 FinQA questions to full SEC filings averaging 123k words; arXiv:2401.06915, ACL 2024. Directly tests the long-document finance reasoning that TheAgentCompany's 12 finance tasks cannot adequately sample.
- Evaluation and Benchmarking of LLM Agents: A Survey — arXiv:2507.21504. A 2025 survey of the agent evaluation landscape that puts TheAgentCompany in context alongside WebArena, OSWorld, and SWE-bench and traces how benchmark design choices shape what we can conclude about agent capability.
