FinQA (EMNLP 2021) built 8,281 QA pairs from S&P 500 earnings reports requiring multi-step arithmetic programs. Neural models scored 61% at release versus 91% for human experts; accuracy collapses to 22% on three-or-more-step programs. The failure modes — domain constants, cross-modality grounding, chain length — map directly to the challenges Beancount agents face today.
DSPy replaces hand-crafted prompt strings with declarative signatures and a metric-driven compiler—boosting Llama2-13b from 9.4% to 46.9% on GSM8K math reasoning and offering a more maintainable path for production finance AI pipelines.
Self-RAG (ICLR 2024 Oral) trains a language model to decide when to retrieve and then grade its own results using four reflection tokens — reaching 55.8% on PopQA and 80.2 FactScore on biographies while outperforming ChatGPT on five benchmarks. Analysis covers the mechanism, ablation results, reproducibility limits, and implications for finance AI agents over Beancount ledgers.
HippoRAG (NeurIPS 2024) builds a knowledge graph from OpenIE triples and applies Personalized PageRank at query time, reaching 89.1% Recall@5 on 2WikiMultiHopQA versus 68.2% for ColBERTv2—with direct implications for querying complex financial ledgers across multi-year transaction histories.
Bloomberg trained a 50B-parameter LLM on 569B tokens of financial data and beat general models on sentiment and table-reasoning benchmarks — then GPT-4 matched it without any finance-specific pretraining. What the $10M experiment reveals about domain pretraining trade-offs, tokenization of numbers, and why tool-use is more reliable than model internals for accounting agents.
AutoGen (Wu et al., 2023) introduces a multi-agent conversation framework where LLM-backed agents pass messages to complete tasks; a two-agent setup lifts MATH benchmark accuracy from 55% to 69%, and a dedicated SafeGuard agent improves unsafe-code detection by up to 35 F1 points — findings directly applicable to building safe, modular Beancount automation pipelines.
MemGPT applies OS-style virtual memory paging to LLMs, using three-tier storage — working memory, recall, and archival — to give agents persistent recall across sessions; on multi-session chat benchmarks, MemGPT with GPT-4 achieves 92.5% accuracy versus a 32.1% fixed-context baseline.
Huang et al. (ICLR 2024) show that LLMs asked to review their own reasoning without external feedback consistently degrade accuracy — GPT-4 drops from 95.5% to 91.5% on GSM8K — and what this means for designing reliable Beancount journal entry agents.
CRITIC (ICLR 2024) achieves 7.7 F1 gains on open-domain QA and a 79.2% toxicity reduction by grounding LLM revision in external tool signals — a verify-then-correct loop that maps directly onto write-back safety for Beancount finance agents.
Себесъгласуваността заменя „алчното“ декодиране на веригата от мисли с гласуване с мнозинство върху N извлечени пътища на разсъждение — повишавайки точността на GPT-3 върху GSM8K със 17,9 процентни пункта без допълнително обучение — и се прилага директно към многостъпкови финансови изчисления, където единичното декодиране на модела е ненадеждно.