EnterpriseArena runs 11 LLMs through a 132-month CFO simulation tracking survival, terminal valuation, and book-closing rates. Only Qwen3.5-9B survives 80% of runs; GPT-5.4 and DeepSeek-V3.1 hit 0%. Human experts achieve 100% survival at 5× the terminal value. The critical bottleneck: LLMs skip ledger reconciliation 80% of the time, acting on stale financial state.
WildToolBench (ICLR 2026) evaluates 57 LLMs on 1,024 tasks drawn from real user behavior — no model exceeds 15% session accuracy, with compositional orchestration, hidden intent, and instruction transitions as the three sharpest failure modes.
JSONSchemaBench tests 9,558 real-world JSON schemas against six constrained decoding frameworks and finds that schema complexity causes coverage to collapse from 86% on simple schemas to 3% on complex ones, with XGrammar silently emitting 38 non-compliant outputs and no framework covering all 45 JSON Schema feature categories.
FinMCP-Bench evaluates six LLM models on 613 real-world financial tool-use tasks backed by 65 MCP servers — the best model scores 3.08% exact match on multi-turn tasks, revealing a 20× performance collapse from single-tool to multi-turn scenarios.
FinTrace benchmarks 13 LLMs on 800 expert-annotated financial task trajectories across 9 metrics, finding that frontier models achieve strong tool selection (F1 ~0.9) but score only 3.23/5 on information utilization — the step where agents reason over what tools return.
FinToolBench поєднує 760 активних фінансових інструментів API з 295 виконуваними запитами для тестування агентів LLM на реальних фінансових завданнях — виявивши, що консервативна частота викликів GPT-4o у 22,7% забезпечує вищу якість відповідей (CSS 0,670), ніж агресивна TIR Qwen3-8B у 87,1%, тоді як невідповідність намірів перевищує 50% у всіх протестованих моделях.
OmniEval (EMNLP 2025) benchmarks RAG systems across 5 task types × 16 financial topics using 11.4k auto-generated test cases. The best systems achieve only 36% numerical accuracy — concrete evidence that RAG pipelines need validation layers before writing to structured financial ledgers.
A training-free inference-time calibration subtracts positional bias from LLM attention weights, recovering up to 15 percentage points of RAG accuracy when retrieved documents are buried mid-context — and what it means for finance-specific agent pipelines.
ReDAct runs a small model by default and escalates to an expensive model only when token-level perplexity signals uncertainty, achieving 64% cost savings over GPT-5.2-only while matching or exceeding its accuracy — a directly applicable pattern for Beancount transaction-categorization agents.
OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.