Skip to main content

Can LLM Agents Be CFOs? EnterpriseArena's 132-Month Simulation Reveals a Wide Gap

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

The most ambitious question in finance AI right now is not "can an LLM answer a question about a balance sheet?" but "can an LLM manage a company's money over time without running out of it?" Yi Han et al.'s Can LLM Agents Be CFOs? (arXiv:2603.23638) builds EnterpriseArena to test exactly that, and the answer is: barely, and not in the ways you'd expect.

The paper

2026-07-11-can-llm-agents-be-cfos-enterprisearena-resource-allocation-benchmark

EnterpriseArena is a 132-month (11-year) simulation of CFO-level resource allocation. Each timestep represents one month. The agent receives partial observations of firm-level financials, anonymized business documents, and macroeconomic signals drawn from FRED, CBOE, and S&P Global data. It has a budget of 20 tool calls per month spread across four operations — verifying cash position, reviewing financial records, analyzing market conditions, and projecting cash flows — and must choose one of three actions: close the books (reconciliation), request funding (equity or debt, with stochastic outcomes), or pass. The primary constraint is that the company's cash balance must stay non-negative at every timestep; violation ends the episode with a score of zero. Subject to survival, the agent maximizes terminal enterprise valuation under the scoring formula Rev_T × 5 + Cash_T − 5,000 × N_tools, which explicitly penalizes excessive tool use.

Eleven LLMs were evaluated, including Gemini-3.1-Pro, Claude-Haiku-4.5, GPT-5.4, DeepSeek-V3.1, Llama-3.3-70B, Qwen3.5-397B, and Qwen3.5-9B, alongside a human expert baseline validated by two finance professionals with 8 and 14 years of experience respectively.

Key ideas

  • Survival rates vary wildly across models: Qwen3.5-9B survives 80% of runs, Gemini-3.1-Pro 50%, Claude-Haiku-4.5 and GLM-5 each 20%, and GPT-5.4, DeepSeek-V3.1, Llama-3.3-70B, Mistral-Small-24B, and Mixtral-8x7B each 0%. The overall LLM average is 26%.
  • Larger models do not reliably outperform smaller ones: Qwen3.5-9B (9B parameters, 80% survival, $78.8M terminal valuation) decisively beats Qwen3.5-397B (397B parameters, 20% survival) and GPT-5.4 (0% survival).
  • The gap from humans is large: the human baseline achieves 100% survival and $152.2M ± $29.6M terminal valuation; the LLM average is $28.2M with 26% survival.
  • Book-closing is the critical bottleneck: human experts close the books (reconcile) at 94.3% of timesteps; LLMs average 19.3%. This is the action that produces ground-truth financial statements and enables rational subsequent decisions.
  • Information gathering without action is lethal: Qwen3.5-397B uses market-analysis and forecasting tools at a high rate throughout the simulation but almost never closes books (0.0% book-closing rate) and almost never requests funding, dying from cash exhaustion despite "knowing" what was happening.
  • The tool-budget penalty matters: the scoring formula actively punishes agents that compulsively check rather than act, a constraint that mirrors real opportunity cost.

What holds up — and what doesn't

The dual-objective design — survival as a hard constraint plus terminal valuation — is one of the strongest choices in recent agent benchmarking. It reflects how real CFOs actually operate: you cannot optimize growth if you're out of money. The anonymization of calendar dates and company identities prevents models from pattern-matching on memorized historical outcomes, which is a genuine methodological improvement over finance benchmarks that use real tickers and real dates.

The failure mode taxonomy the authors identify through case studies is credible: GPT-5.4 achieves a 99.1% pass rate (meaning it takes action at almost every timestep by doing nothing), while Qwen3.5-397B mistakes analysis for action. These are behaviorally distinct failure modes with different remedies.

What I'm less convinced by: the stochastic macro environment uses Gaussian noise to approximate market shocks, which the authors themselves acknowledge cannot replicate black-swan events or human irrationality. The tool budget of 20 calls per month is also somewhat arbitrary — real CFOs don't face this kind of query-rate constraint on their own memory, which raises the question of whether the benchmark is measuring long-horizon financial judgment or something closer to RAG-under-resource-pressure. The single-agent structure is another explicit limitation the authors name: real CFOs operate within hierarchies of controllers, FP&A analysts, and treasury teams, and the paper does not attempt to simulate this.

The finding that model size doesn't predict survival is striking and probably genuine, but the mechanism isn't well explained. The authors note it without fully unpacking whether it's a failure of instruction-following, long-context coherence, or risk calibration.

Why this matters for finance AI

The book-closing action in EnterpriseArena is essentially the Beancount balance assertion and ledger reconciliation step — the moment when the agent commits to a ground-truth view of financial state before acting. The finding that LLMs skip this 80% of the time maps directly onto the write-back safety problem: an agent that avoids reconciliation before acting is an agent that acts on stale or hallucinated state. For Beancount automation, this suggests that the reconciliation step should be mandatory and verifiable — not optional — in any agent loop.

The 132-month horizon is also directly analogous to multi-year ledger management. The finding that sustained situational awareness degrades over time is the same degradation we'd expect in a Beancount agent managing five years of transaction history: even if the agent has all the data in context, it may not act on it coherently at month 60. This suggests that periodic forced reconciliation checkpoints — not just reactive querying — are necessary in long-running Beancount agent sessions.

The information-gathering trap Qwen3.5-397B falls into is a useful design warning: agents equipped with many retrieval tools may prefer retrieval to commitment, especially when the cost of a wrong action (ledger corruption) is high. Tool-budget constraints of the kind EnterpriseArena uses could help enforce action discipline in Beancount write-back agents.

  • EcoGym (arXiv:2602.09514) — complementary long-horizon economy benchmark across Vending, Freelance, and Operation environments over 1,000+ steps; no model dominates across all three, suggesting the failure modes in EnterpriseArena are not idiosyncratic to one benchmark design.
  • AFlow: Automating Agentic Workflow Generation (arXiv:2410.10762, ICLR 2025 oral) — reformulates workflow design as code-space search with MCTS and LLM feedback; if EnterpriseArena shows that manually designed agent behaviors fail, AFlow is the obvious next step for discovering better pipelines automatically.
  • ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-world APIs (arXiv:2307.16789, ICLR 2024) — the foundational tool-use training and evaluation framework; understanding how tool-calling behavior is learned in ToolLLM clarifies whether the action-avoidance failure in EnterpriseArena is a training problem or a prompting problem.