Skip to main content

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

After spending several log entries on multi-agent debate and guardrail architectures, I wanted to pressure-test the premise: does orchestrating multiple LLMs actually buy us better reasoning, or are we just spending more compute? Dat Tran and Douwe Kiela from Stanford ask exactly that in a preprint posted in April 2026, and the answer is uncomfortable for multi-agent evangelists.

The paper

2026-05-31-single-agent-outperforms-multi-agent-equal-token-budget

"Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets" (arXiv:2604.02460) makes a deceptively simple methodological point: almost all multi-agent benchmarks compare a single agent against a multi-agent system that uses significantly more computation. Once you hold the thinking-token budget constant — matching intermediate reasoning tokens, excluding prompts and final answers — single agents match or beat multi-agent systems on multi-hop reasoning tasks.

The authors frame this with an information-theoretic argument via the Data Processing Inequality. When one agent passes a message to another, the receiving agent works from a processed version of the original context, not the context itself. Information can only be lost or stay the same in that chain — never gained. The DPI therefore predicts that multi-agent decomposition introduces unavoidable communication bottlenecks, and multi-agent systems can only outperform single agents when a single agent's effective context utilization is already degraded.

Key ideas

  • The study controls for "thinking tokens" — intermediate reasoning tokens only — across six token budgets from 100 to 10,000 tokens, using three model families: Qwen3-30B, DeepSeek-R1-Distill-Llama-70B, and Gemini 2.5.
  • Five multi-agent architectures are evaluated: sequential, subtask-parallel, parallel-roles, debate, and ensemble.
  • Benchmarks used are FRAMES (824 challenging multi-hop questions requiring integration from multiple sources) and MuSiQue (4-hop world knowledge questions).
  • Single-agent systems achieved the highest or statistically equivalent accuracy in nearly all budget-matched conditions. SAS accuracy ranged 0.280–0.427 across budgets; comparable MAS variants averaged 0.280–0.420.
  • The characteristic failure mode for MAS is over-exploration and drift: agents explore sub-questions without pruning and lose track of the original query. SAS maintains stronger lexical anchoring to the original question.
  • The DPI prediction holds empirically: under heavy context degradation (masking or substitution at α=0.7), multi-agent systems become competitive — but only then.

What holds up — and what doesn't

The core methodology is the right move. The field has a reproducibility problem with multi-agent benchmarks precisely because compute is rarely held constant, and the authors' insistence on matched thinking budgets is a genuine contribution. The DPI framing is clean, and the experimental prediction it generates — MAS helps when context utilization breaks down — is verified across three model families, which adds credibility.

That said, several gaps matter. The paper evaluates only text-based multi-hop reasoning. It explicitly excludes tool use, code execution, and vision tasks. That exclusion is significant: most of the production multi-agent systems people actually deploy are not doing pure text QA but are orchestrating tool calls, API lookups, or code interpreters across agents. The DPI argument about message passing between agents is theoretically applicable to these settings, but the empirical claim has not been validated there.

The Gemini token budget control is acknowledged as approximate — the authors developed a special SAS-L variant with structured prompting because Gemini's thinking channel appeared underutilized in standard single-agent mode. That's a confound worth scrutinizing. If thinking-token accounting is unreliable for one of the three model families, the budget-equalization claim becomes harder to interpret.

Two benchmarks is also thin for a general architectural claim. FRAMES has only 824 questions; MuSiQue is a standard benchmark but doesn't cover the full diversity of multi-hop structures. And the paper doesn't address how the single-versus-multi gap changes as model capability scales — the result might be a property of current model sizes rather than a fundamental architectural finding.

Why this matters for finance AI

The connection to Bean Labs is real but needs precision. For a Beancount write-back agent, the architecture I'm most interested in is a writer-verifier pair: one agent generates a ledger entry, another checks it for policy compliance before committing. That's not multi-hop text QA — it's a sequential tool-use pipeline where the verifier is examining a proposed artifact rather than re-processing the same original context. The DPI argument applies loosely: a separate verification agent working from the proposed entry still can't recover facts the writer discarded. But the bottleneck in practice is policy rule recall and arithmetic correctness, not information loss across messages.

Where this paper hits more directly is the debate architectures considered in earlier logs (Du et al., M3MAD-Bench). If the goal is a debating pair of agents to catch ledger errors, and if both agents have the same total thinking budget as a single agent with extended reasoning, the evidence here suggests the single-agent approach is more reliable. The finding that MAS is competitive only when context is heavily degraded also matters: for well-structured Beancount entries, where context is clean and well-formed, the single-agent advantage should hold.

The practical lesson is to be suspicious of multi-agent complexity unless you have a specific reason to believe context utilization is the bottleneck. For most ledger QA tasks, it probably isn't.

  • Mixture-of-Agents Enhances Large Language Model Capabilities (arXiv:2406.04692) — the paper whose AlpacaEval claims this most directly challenges; worth reading to understand exactly what budget assumptions it made.
  • "Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?" (arXiv:2402.18272, ACL 2024) — an earlier version of essentially the same finding: single agent with good prompts matches multi-agent discussion; useful for seeing how the critique has evolved.
  • Test-time compute scaling literature (DeepSeek-R1, OpenAI o1 system card) — the broader question is where additional inference compute actually helps, and extended chain-of-thought within a single model may be the more robust answer.