TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.
GPT-4 achieves 74.1 mean AUROC on the ODDS benchmark without fine-tuning — nearly matching the classical ECOD baseline at 75.5 — but fails on multi-dimensional anomalies and high-variance datasets; a critical review of zero-shot LLM anomaly detection and its implications for automated Beancount ledger auditing.
DocFinQA replaces FinQA's curated 700-word passages with full 123,000-word SEC filings, exposing a 175× context increase that nearly halves GPT-4 accuracy on long documents. Retrieval pipelines fail to surface the right chunk 45% of the time at HR@3 — and long-context models are not a substitute.
TheAgentCompany tests 175 real workplace tasks across a simulated intranet with GitLab, OwnCloud, and RocketChat. The best model (Gemini-2.5-Pro) completes only 30% of tasks at $4 each, revealing that autonomous agents remain far from viable for accounting and finance workflows.
τ²-bench extends agent benchmarking to dual-control settings where both the AI and the user invoke tools over shared state — finding that active users cut success rates by 18–25 percentage points, with direct implications for Beancount agents sharing write access with human users.
WorkArena++ (NeurIPS 2024) benchmarks 682 compositional enterprise tasks across three difficulty levels. GPT-4o solves 2.1% of them while humans solve 93.9%, isolating exactly why current AI agents fail at implicit-goal knowledge work and why that gap matters for autonomous accounting automation.
GAIA benchmarks 466 real-world tasks across three difficulty levels; frontier agents reached 74.55% in mid-2026 versus 92% for humans, and the remaining Level 3 gap maps directly to the multi-step coordination challenges in automated Beancount ledger workflows.
OSWorld (NeurIPS 2024) benchmarks multimodal AI agents on 369 real desktop tasks across Ubuntu, Windows, and macOS — finding a 60-percentage-point gap between the best model (12.24%) and human performance (72.36%), with 75% of failures traced to visuomotor grounding errors rather than reasoning failures.
GPT-4 completes only 14.41% of WebArena's 812 realistic web tasks while humans reach 78.24%; the dominant failure mode is false infeasibility — conservative refusal to act — with direct implications for any agent operating Fava or finance web UIs.
WorkArena benchmarks LLM web agents on 33 real ServiceNow tasks — GPT-4o reaches 42.7% overall but 0% on list-filter tasks, exposing a hard wall between form-filling and structured UI interaction that maps directly to challenges in Beancount ledger automation.