WorkArena++: The 93% Gap Between Human and AI Agent Performance on Compositional Enterprise Tasks

June 17, 2026 · 5 min read

Mike Thrift

Marketing Manager

WorkArena++ (arXiv:2407.05291, NeurIPS 2024) extends the original WorkArena benchmark to 682 compositional enterprise tasks that require chaining multiple workflows — exactly the multi-step knowledge work that a Beancount automation agent would need to handle. I'm reading it now because the original WorkArena log (LOG-061) left open the question of what happens when you compose atomic tasks into real workflows. The answer, as this paper makes clear, is that every current LLM falls off a cliff.

The paper

2026-06-17-workarena-plus-plus-compositional-planning-enterprise-agents

Boisvert et al. at ServiceNow Research take the atomic task components from the original WorkArena — form filling, list filtering, knowledge base search, dashboard reading — and compose them into realistic multi-step enterprise workflows. The benchmark runs entirely inside a live ServiceNow instance via the BrowserGym environment, giving agents HTML observations and optional screenshot inputs.

The key structural decision is a three-level difficulty hierarchy. L1 is the original WorkArena: atomic, single-action tasks like "filter this list by status = Closed." L2 introduces compositional tasks with explicit step-by-step instructions — the agent receives a full procedure in the chat but must execute a chain of sub-tasks across different ServiceNow modules without losing track. L3 is the hard version: the agent gets only an implicit goal ("onboard a new employee") and must first retrieve the relevant procedure from the company's knowledge base before planning and executing the steps. That is exactly how real knowledge workers operate.

The authors also include a mechanism to automatically generate ground-truth observation-action traces from oracle solutions, enabling supervised fine-tuning without manual annotation.

Key ideas

Humans solve 93.9% of composite tasks; GPT-4o solves 2.1%. This is not a language understanding failure — it is a planning and execution failure at scale.
No model completes any L3 task. The requirement to retrieve a procedure, plan steps, and execute without explicit guidance is completely unsolved by all tested models, including GPT-4o-v (the vision-capable variant).
Only GPT-4o and GPT-4o-v succeed at a subset of L2 tasks, primarily memorization sub-tasks. Llama3-based agents largely fail at both L2 and L3.
L3 task realism is the key design choice: receiving an implicit goal like "onboard a new employee" without a procedure — and then having to look it up — is how employees actually receive assignments in enterprise settings.
Five capability dimensions are tested: planning under constraints, information retrieval, data-driven reasoning, sequential memory, and recognizing infeasible tasks.
Documented failure modes: hallucinations about UI elements, inability to maintain multi-step plans across a long context, and failure to cross-reference information from separate documents.

What holds up — and what doesn't

The 93.9% vs. 2.1% headline is striking but mechanistically explicable. L2 and L3 require a model to remember what it did three steps ago, correlate information retrieved from one document with a form it is about to fill, and know when a sub-step depends on completing a prior one. These are not exotic — humans do them effortlessly — but current LLM agents break on the coordination.

What I find most valuable here is the L2-versus-L3 design. L2 hands the agent a procedure; L3 does not. The performance cliff between them isolates exactly one capability: substituting retrieval-plus-planning for explicit instruction-following. That is the hard part of autonomous knowledge work, and the benchmark cleanly exposes it.

What the paper does not do is show that the training trace mechanism actually helps. The authors provide the infrastructure to generate fine-tuning data and state that models can be trained on it — but they do not report results from doing so. Without that experiment, WorkArena++ is a benchmark on which all current agents fail, with no demonstrated path to improvement. That limits its near-term utility as a training target.

The reliance on ServiceNow also constrains generalizability. ServiceNow has an unusually structured, well-documented interface. If agents fail here, they will fail worse on the messier enterprise systems that most organizations actually run.

Why this matters for finance AI

The connection to Beancount automation is direct. An autonomous accounting agent is doing L3-style work by default: a user says "reconcile last month's expenses," and the agent must retrieve the relevant account structure from the ledger, plan which entries to inspect, cross-reference against imported bank data, and execute write-back operations — all without a step-by-step guide. WorkArena++ puts a number on how badly current agents handle this pattern.

The training trace mechanism is also immediately applicable. Beancount tasks have deterministic oracle solutions — the correct journal entries are verifiable — which means ground-truth traces could be generated at scale for fine-tuning a specialized ledger agent. That is precisely what WorkArena++ enables without exploiting in the paper itself. It is a design blueprint more than a solved problem.

The zero L3 success rate is the most useful calibration point for Bean Labs: even in a controlled enterprise environment with clean data and a well-structured interface, state-of-the-art agents cannot yet handle implicit-goal compositional tasks. That gap is where the interesting research lives.

WorkArena++: The 93% Gap Between Human and AI Agent Performance on Compositional Enterprise Tasks

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next