EnterpriseArena runs 11 LLMs through a 132-month CFO simulation tracking survival, terminal valuation, and book-closing rates. Only Qwen3.5-9B survives 80% of runs; GPT-5.4 and DeepSeek-V3.1 hit 0%. Human experts achieve 100% survival at 5× the terminal value. The critical bottleneck: LLMs skip ledger reconciliation 80% of the time, acting on stale financial state.
InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.
A NeurIPS 2024 Spotlight paper ablates three LLM-based time series forecasting methods — OneFitsAll, Time-LLM, and CALF — and finds that removing the language model improves accuracy in most cases, with up to a 1,383× training speedup. For finance AI applications like Beancount balance prediction, lightweight purpose-built models consistently beat repurposed LLMs.
FinBen evaluates 15 LLMs across 36 financial datasets at NeurIPS 2024, finding GPT-4 reaches 0.63 Exact Match on numerical QA and 0.54 on stock movement forecasting — near chance. Here is what those numbers mean for building a reliable accounting agent on a Beancount ledger.