WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use
The tool-use benchmarks I've been tracking — BFCL, ToolBench, τ-bench — all share a common design flaw: they construct tasks from the benchmark authors' imagination of what users do. WildToolBench, accepted at ICLR 2026, goes back to real user logs and asks what users actually do. The answer is humbling: 57 LLMs evaluated, zero exceed 15% session accuracy.
The paper
Peijie Yu, Wei Liu, Yifan Yang, and colleagues at Alibaba present WildToolBench (arXiv:2604.06185), a benchmark of 256 multi-turn dialogue scenarios with 1,024 tasks drawn from authentic user behavior patterns and grounded in ~1,600 public APIs. The core argument is that existing benchmarks are saturating not because the models are good, but because the tasks are artificial. Real users bundle requests together, omit context they shared two turns ago, and switch between asking a tool question, making small talk, and requesting a clarification — sometimes within a single message. WildToolBench operationalizes these failure modes into three structured challenge categories and measures both task-level accuracy and the much stricter session-level accuracy, which requires succeeding at all four tasks in a dialogue.
Key ideas
- Session accuracy collapses to single digits for most models: Gemini-2.0-Flash-Thinking leads at 14.45% session accuracy, Claude-4-Sonnet at 12.50%, GPT-4o at 11.72%. Passing all tasks in a four-turn session is hard enough that even 60% task accuracy translates to under 15% session accuracy — a compound-probability tax on every interaction.
- Compositional orchestration is the sharpest cliff: Mixed sequential-plus-parallel tool topologies cap top models at 25% task accuracy, versus 54–62% for purely parallel or sequential chains. When a task requires a parallel fan-out followed by a sequential merge, the coordination problem exceeds what any current model handles reliably.
- Hidden intent is a bigger gap than anyone measured before: WildToolBench ensures 100% of tasks involve implicit or cross-turn information; BFCL v3 manages only 15.7%. Long-range dependency tasks — where the missing information is more than two turns back — are the hardest sub-type, with no model breaking 50% even at the task level.
- Instruction transitions compound errors at a linear rate: Each additional policy switch (tool task → chat → clarification → tool task) drops accuracy by roughly 5–15 percentage points. At three transitions, the worst-affected models lose 30 points. The authors call this "self-conditioning": prior responses bias the model's interpretation of subsequent instructions in ways that are difficult to correct mid-session.
- Optimal Path Rate stays below 43%: Even when models complete tasks correctly, they burn excess API calls. Claude-4-Sonnet achieves the best Optimal Path Rate at 42.74%, meaning the majority of correct completions take more steps than necessary — a direct cost in latency and tokens for any production system.
- Specialized tool-use models underperform general frontier models: xLAM-2-70B and ToolACE2-8B both post wrong-function-name error rates exceeding 30%, worse than GPT-4o or Claude-4-Sonnet. Fine-tuning on narrow tool-use corpora appears to create brittleness rather than robustness under distribution shift to wild user behavior.
What holds up — and what doesn't
The benchmark design is strong where it matters most. The distinction between task accuracy and session accuracy is exactly right: compounding failure modes is what kills real deployments, and most prior work reports task-level numbers that mask this. The three-challenge taxonomy (compositional orchestration, hidden intent, instruction transitions) is well-motivated and empirically substantiated — the performance degradation curves across challenge types are real and striking.
The weak spot is scale. 1,024 tasks from 256 scenarios is a credible research artifact but thin for a leaderboard intended to track 57 models over time. The authors acknowledge this directly and mention an automated scaling pipeline in future work. The other issue is that "grounded in real user logs" is doing a lot of work: the final tasks are partially synthetic, constructed by a multi-agent system from seed patterns, then verified by human annotators. The claim is grounded but the data is not verbatim wild — it is wild-inspired. That matters for how literally you interpret the 15% ceiling; some fraction of the gap might close if the generation pipeline introduces artificial difficulty that real users don't actually exhibit.
I'm also skeptical of the instruction-transition analysis as an architectural claim. The paper attributes it to a fundamental limitation, but the training-distribution mismatch between RLHF fine-tuning objectives and multi-modal user sessions is the more parsimonious explanation. That is addressable, not structural.
Why this matters for finance AI
The three failure modes map almost perfectly onto how real users interact with a Beancount write-back agent. A user asks "how much did I spend on groceries last month, and while you're at it add today's Whole Foods receipt" — that is a compositional task bundled into one turn. They follow it with "actually make it $47.23 not $42, I looked it up" — that is a parameter correction requiring the agent to track session state. Then they ask "is that category right?" — that is a clarification request, and the agent needs to not re-execute the write operation it just finished. The 25% cap on mixed sequential-plus-parallel orchestration and the 30-point drop from instruction transitions are exactly the failure modes that would manifest in a ledger agent fielding real user sessions.
The finding that specialized tool-use models underperform general frontier models is particularly relevant. If we were considering fine-tuning a smaller open model on Beancount-specific tool-calling examples — the obvious cost-reduction play — WildToolBench is a direct warning that specialization may sacrifice robustness to the distribution of actual user behavior. The Optimal Path Rate finding matters too: an agent that uses twice as many API calls to complete a task is not just inefficient; for write-back operations, redundant intermediate calls can leave the ledger in inconsistent intermediate states.
What to read next
- ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs (arXiv:2307.16789, ICLR 2024) — the foundational training framework WildToolBench explicitly positions against; understanding its synthetic evaluation design clarifies exactly what live execution adds.
- τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (arXiv:2406.12045) — the closest prior work on realistic multi-turn tool use; comparing τ-bench's retail/airline domains against WildToolBench's public API coverage shows how much the challenge generalizes.
- AFlow: Automating Agentic Workflow Generation (arXiv:2410.10762, ICLR 2025 oral) — if the instruction-transition problem is addressable by automatically discovering better agent workflows rather than scaling training data, AFlow is the most credible mechanism for doing so.
