FinQA (EMNLP 2021) built 8,281 QA pairs from S&P 500 earnings reports requiring multi-step arithmetic programs. Neural models scored 61% at release versus 91% for human experts; accuracy collapses to 22% on three-or-more-step programs. The failure modes — domain constants, cross-modality grounding, chain length — map directly to the challenges Beancount agents face today.
FinanceBench evaluates 16 AI configurations against 10,231 questions from real SEC filings; shared-vector-store RAG answers correctly only 19% of the time, and even GPT-4-Turbo with the oracle passage reaches just 85% accuracy — showing that numerical reasoning, not retrieval, is the binding constraint for enterprise finance AI.
DSPy replaces hand-crafted prompt strings with declarative signatures and a metric-driven compiler—boosting Llama2-13b from 9.4% to 46.9% on GSM8K math reasoning and offering a more maintainable path for production finance AI pipelines.
LATS(Language Agent Tree Search, ICML 2024)는 ReAct, Tree of Thoughts, Reflexion을 단일 MCTS 프레임워크로 통합하여 GPT-4와 함께 HumanEval에서 92.7%의 pass@1을 달성했습니다. Git 기반의 Beancount 장부의 경우, 운영 환경에서 LATS를 제한하는 상태 복원 요구 사항을 아주 쉽게 충족할 수 있습니다.
Self-RAG (ICLR 2024 Oral) trains a language model to decide when to retrieve and then grade its own results using four reflection tokens — reaching 55.8% on PopQA and 80.2 FactScore on biographies while outperforming ChatGPT on five benchmarks. Analysis covers the mechanism, ablation results, reproducibility limits, and implications for finance AI agents over Beancount ledgers.
Voyager, a GPT-4-powered Minecraft agent from NVIDIA and Caltech, demonstrates that a persistent code skill library enables genuine lifelong learning without fine-tuning — discovering 3.3× more items than prior state-of-the-art. The pattern maps directly onto long-horizon Beancount ledger automation, though financial correctness demands staging layers that game sandboxes never require.
HippoRAG (NeurIPS 2024) builds a knowledge graph from OpenIE triples and applies Personalized PageRank at query time, reaching 89.1% Recall@5 on 2WikiMultiHopQA versus 68.2% for ColBERTv2—with direct implications for querying complex financial ledgers across multi-year transaction histories.
AgentBench(Liu 等人,ICLR 2024)在 8 个交互式环境中对 27 个大语言模型进行了基准测试 —— GPT-4 的综合得分为 4.01,而表现最好的开源模型仅为 0.96。三种主要的失败模式(知识图谱失败中 67.9% 为超出任务限制、数据库失败中 53.3% 为格式错误以及无效操作)直接对应了在真实账本上部署 Beancount 回写代理的风险。
Bloomberg trained a 50B-parameter LLM on 569B tokens of financial data and beat general models on sentiment and table-reasoning benchmarks — then GPT-4 matched it without any finance-specific pretraining. What the $10M experiment reveals about domain pretraining trade-offs, tokenization of numbers, and why tool-use is more reliable than model internals for accounting agents.
AutoGen (Wu et al., 2023) introduces a multi-agent conversation framework where LLM-backed agents pass messages to complete tasks; a two-agent setup lifts MATH benchmark accuracy from 55% to 69%, and a dedicated SafeGuard agent improves unsafe-code detection by up to 35 F1 points — findings directly applicable to building safe, modular Beancount automation pipelines.