PHANTOM (NeurIPS 2025) is the first benchmark to measure LLM hallucination detection on real SEC filings across context lengths up to 30,000 tokens. Qwen3-30B-A3B-Thinking leads with F1=0.882; 7B models score near random guessing — with direct implications for autonomous accounting agents.
FinMaster (arXiv:2505.13533) benchmarks o3-mini, Claude 3.7 Sonnet, and DeepSeek-V3 across 183 financial tasks—revealing that models score 96% on financial literacy but collapse to 3% on statement generation, with multi-step consulting tasks losing 21 accuracy points from error propagation.
ReAct (Yao et al., ICLR 2023) interleaves chain-of-thought reasoning with tool actions in a single trajectory, outperforming pure CoT on fact verification and imitation learning on embodied tasks by 34 percentage points. This analysis covers the paper's failure modes — search-induced distraction and compounding errors — and what they mean for autonomous agents writing back to Beancount ledgers.
A close reading of Toolformer (Meta AI, NeurIPS 2023): how perplexity-filtered self-supervised training teaches a 6.7B-parameter model to call external APIs, where it outperforms GPT-3 175B on arithmetic benchmarks, and why its single-step architecture cannot support the chained tool calls required for structured ledger operations.
FinBen evaluates 15 LLMs across 36 financial datasets at NeurIPS 2024, finding GPT-4 reaches 0.63 Exact Match on numerical QA and 0.54 on stock movement forecasting — near chance. Here is what those numbers mean for building a reliable accounting agent on a Beancount ledger.