WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work
After reading τ-bench's evaluation of tool-calling agents in retail and airline domains, I wanted to push into enterprise software — the territory where Beancount-style agents actually need to operate. WorkArena (Drouin et al., ServiceNow Research, 2024) benchmarks LLM web agents on 33 real tasks inside the ServiceNow enterprise platform, making it the most direct existing test of whether current models can automate genuine knowledge-worker workflows rather than synthetic toy scenarios.
The paper
"WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?" introduces a benchmark of 33 tasks and 19,912 unique instances drawn from the ServiceNow enterprise software platform. The tasks cover six categories that knowledge workers actually perform daily: filtering and sorting lists, filling forms, searching knowledge bases, ordering from service catalogs, reading dashboards, and navigating menus. Alongside the benchmark, the authors release BrowserGym, an evaluation harness that gives agents rich multimodal observations — HTML, accessibility trees, screenshots — plus a standardized action space for web interactions.
The core question the paper asks is whether current LLMs can handle the structured, multi-step, UI-constrained workflows that real enterprise software demands. These are not open-ended search tasks or single-turn QA; they are goal-directed sequences of clicks, form entries, and filter operations that leave verifiable traces in a live system. That verification-from-system-state property is what makes WorkArena meaningfully different from most agentic benchmarks, and it is exactly the property a Beancount write-back agent would need to satisfy.
Key ideas
- GPT-4o reaches 42.7% overall on WorkArena with chain-of-thought prompting; GPT-3.5-Turbo manages only 6.1%, and the open-source Llama3-70B-Instruct sits at 17.9% — a 25-point gap between frontier proprietary and frontier open-source models.
- List-filter tasks are a complete wall: 0% for every model. The ServiceNow list widget uses non-standard HTML that none of the tested agents could reliably interact with. Sorting is nearly as bad: GPT-4o achieves only 10% on list-sort tasks.
- Service catalog tasks are surprisingly tractable: GPT-4o reaches 77.8% on the nine service-catalog tasks, where the UI is more conventional and the required actions map closely to form-filling patterns the model has likely seen in training.
- Multimodal observations barely help. Adding screenshots to GPT-4o's observations produced "very minor performance improvements," suggesting that the bottleneck is UI structure understanding, not the absence of visual input.
- Chain-of-thought is load-bearing. Removing it drops Llama3-70B by around 10 points on WorkArena, confirming that multi-step web tasks require explicit intermediate reasoning, not just action prediction.
- Memory mechanisms backfired. Enabling a
use_think_historyflag caused agents to "stick to decisions decided in early time steps, even erroneous ones" — a concrete example of rigid commitment masquerading as planning.
What holds up — and what doesn't
The benchmark's most valuable property is that it runs against a live ServiceNow instance: success is determined by whether the system's state actually changed correctly, not by string matching against an expected output. That makes the 0% on list-filter tasks particularly damning — there is nowhere to hide. The task variety is also genuinely representative: the six categories span the breadth of what knowledge workers spend time on, not cherry-picked showcase tasks.
What I find less satisfying is the treatment of failure modes. The paper identifies that exotic HTML structures, nested iFrames, and shadow DOMs break agents, but it does not systematically ablate which structural features are responsible or in what proportion. The DOM-size problem — HTML trees ranging from 40k to 500k tokens — is mentioned but not deeply analyzed: we don't know whether summarization, chunking, or accessibility-tree-only observations would recover performance. The single-agent architecture is also never compared against a decomposed multi-agent setup (a selector/executor split, for instance), so it is unclear whether the 0% list-filter result is an interface problem, a planning problem, or both.
There is also a validity-of-the-platform question worth raising. ServiceNow is a specific enterprise software stack with idiosyncratic UI patterns. The results tell us a lot about ServiceNow agents and somewhat less about enterprise web agents in general. Generalizing the list-filter failure to, say, a beanquery interface or a spreadsheet tool requires independent evidence.
Why this matters for finance AI
The WorkArena results are a calibration point I keep coming back to for the Beancount automation agenda. The failure pattern is instructive: agents do well on tasks that look like web forms (service catalog, 77.8%) and collapse on tasks that require precise interaction with structured, non-standard UI widgets (list filtering, 0%). A Beancount agent doing ledger entry would face a mixed picture: the natural-language-to-transaction part resembles the form-filling tasks where performance is reasonable; but the query, filter, and reconciliation parts — finding specific entries, sorting by date, applying account filters — look far more like the list tasks where everything breaks.
The paper also reinforces a lesson from the CRITIC and Reflexion logs: external verification matters more than internal reasoning. WorkArena tasks succeed or fail based on system state, and that clean ground truth is what makes the benchmark honest. For Beancount write-back agents, this argues strongly for a design where every committed ledger change is verified against the beancount Python API before being accepted, not just checked by the agent's own reasoning. The 42.7% ceiling on the best model at ICML 2024 suggests that even for conventional enterprise UI tasks, the gap from "occasionally useful" to "reliably automatable" is still large.
What to read next
- WorkArena++ (arXiv:2407.05291, NeurIPS 2024) — the follow-up from the same ServiceNow team with 682 compositional tasks requiring planning, arithmetic reasoning, and multi-document retrieval; directly answers whether scaling task complexity exposes new failure modes beyond the UI-interaction wall.
- WebArena (arXiv:2307.13854, ICLR 2024) — the companion general-purpose web agent benchmark (812 tasks across e-commerce, forums, code hosting, CMS) where GPT-4 achieves only 14.41% versus 78% human performance; places the WorkArena numbers in the broader web-agent landscape.
- OSWorld (arXiv:2404.07972, NeurIPS 2024) — extends the enterprise-automation evaluation to full desktop computer environments including real applications (LibreOffice, VS Code, Chrome); the most comprehensive test of whether the WorkArena failure modes are UI-specific or reflect a deeper agent competence gap.
