Skip to main content

OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%

· 5 min read
Mike Thrift
Mike Thrift
Marketing Manager

Yesterday I read WebArena, which placed autonomous web agents at roughly 14% success against a 78% human baseline. OSWorld (Xie et al., NeurIPS 2024) asks the same question for the full desktop: Ubuntu, Windows, macOS, real GUI applications. The answer is, if anything, more humbling — and the failure mode is different enough to be interesting on its own.

The paper

2026-06-15-osworld-benchmarking-multimodal-agents-real-computer-environments

OSWorld builds a benchmark of 369 tasks grounded in real desktop applications: LibreOffice, Chrome, VS Code, GIMP, Thunderbird, VLC, and multi-application workflows. Each task comes with a programmatic evaluation script that checks the actual system state after execution — no string-matching heuristics, no LLM-as-judge. The setup uses virtual machines so tasks start from a reproducible state, and it covers all three major operating systems.

The authors test a range of frontier models — GPT-4V, Gemini-Pro-Vision, Claude-3 Opus, Mixtral, CogAgent — across four input configurations: screenshot only, accessibility tree only, screenshot plus accessibility tree, and Set-of-Marks (SoM, where interactive elements are overlaid with numeric labels before the model acts).

Key ideas

  • Humans on unfamiliar tasks succeed 72.36% of the time. The best model at submission time achieves 12.24%. The gap is ~60 percentage points.
  • Screenshot-only performance for top models (GPT-4V, Gemini-Pro-Vision) sits around 5.26%–5.80% — meaning adding structured context roughly doubles success, but still leaves 87% failure.
  • Multi-application workflow tasks are the hardest category at a ceiling of 6.57%, compared to OS/CLI tasks where text-based interfaces make grounding easier.
  • Accessibility tree and Set-of-Marks help, but their benefit is model-dependent: the authors report they can also introduce confusion by overwhelming the model with irrelevant structure.
  • Post-publication progress has been rapid — Agent S (GPT-4o, hierarchical memory) reached 20.58%; RL-based ARPO pushed to 29.9%; Agent S3 (Simular AI, 2025) claims 62.6% in the 100-step setting, approaching human parity. But most of those gains come from better grounding models and RL fine-tuning, not from the base prompted LLMs OSWorld originally tested.
  • Error analysis of 550 failures: over 75% are mouse click inaccuracies — the agent reasons correctly but clicks the wrong pixel. This isn't a reasoning failure; it's a visuomotor grounding failure.

What holds up — and what doesn't

The benchmark design is genuinely rigorous. Execution-based evaluation over real VMs with 134 distinct evaluation scripts removes the fuzzy judgment calls that plague many agent benchmarks. That's a significant methodological contribution and it's why the number (12.24%) is credible.

The harder question is what 12.24% actually measures. The task distribution is skewed toward GUI-heavy applications where pixel-precise clicking matters enormously. A Beancount agent that runs entirely in the CLI or emits text files would likely perform much better on this benchmark than an agent doing spreadsheet formatting in LibreOffice. The headline number bundles together very different cognitive demands — spatial motor control, multi-step planning, domain knowledge — and attributing it to a single "agents can't use computers" claim oversimplifies.

The "set-of-marks can mislead some models" finding is interesting but underexplored. The paper notes the variance without fully explaining what kinds of tasks or models are helped vs. hurt. That feels like the most important question for practitioners designing agent UIs, and it gets one paragraph.

I'm also skeptical of how well the 369-task sample covers the long tail of real workflows. The tasks are curated by researchers who necessarily skew toward tasks that are verifiable. Genuinely ambiguous real-world accounting tasks — "clean up these inconsistent merchant names" — are hard to evaluate programmatically and likely underrepresented.

Why this matters for finance AI

The 75%-of-failures-are-grounding-errors finding is directly relevant to Beancount agents, even though Beancount lives at the text layer. The deeper pattern — agents plan correctly but execute incorrectly — maps onto ledger write-back failures where an agent generates the right transaction but writes it to the wrong account or with a transposed date. In both cases, the bottleneck is precise execution, not strategic reasoning.

Multi-app workflow performance (6.57%) is the figure I find most sobering for Bean Labs. Real accounting workflows almost always span multiple applications: a bank CSV export, a Beancount file, a reconciliation spreadsheet, a PDF receipt. If GUI agents struggle catastrophically at multi-app coordination even on curated tasks, a Beancount agent that needs to orchestrate imports, ledger edits, and report generation faces a structurally similar challenge — even in a CLI context where there's no pixel-clicking involved.

The good news from the post-paper trajectory (Agent S3 at 62.6%) is that these aren't fundamental barriers. They're solvable with better grounding models and RL fine-tuning. But that progress required 18 months and significant compute for RL training, which is not the default capability baseline that a Beancount agent can assume from a prompted frontier model.

  • AndroidWorld (Rawles et al., arXiv:2405.14573) — extends OSWorld to Android devices with dynamically parameterized tasks, relevant to mobile Beancount interfaces
  • WindowsAgentArena (Bonatti et al., arXiv:2409.08264, ICLR 2025) — adapts OSWorld to Windows with 150+ tasks; independently validates that the gap persists across operating systems
  • Agent S2 (Agashe et al., arXiv:2504.00906) — compositional generalist-specialist architecture that pushes state-of-the-art significantly; worth understanding the architecture before designing a Beancount multi-step planner