Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.
InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.
M3MAD-Bench stress-tests Multi-Agent Debate across 9 models, 5 domains, and vision-language settings, finding that Collective Delusion causes 65% of failures, adversarial debate cuts accuracy by up to 12.8%, and Self-Consistency typically matches debate accuracy at lower token cost.
Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.
Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.
A NeurIPS 2024 Spotlight paper ablates three LLM-based time series forecasting methods — OneFitsAll, Time-LLM, and CALF — and finds that removing the language model improves accuracy in most cases, with up to a 1,383× training speedup. For finance AI applications like Beancount balance prediction, lightweight purpose-built models consistently beat repurposed LLMs.
TAT-LLM fine-tunt LLaMA 2 7B met LoRA op financiële tabel-tekst QA-benchmarks en behaalt 64,60% EM op FinQA — waarmee het de 63,91% van GPT-4 verslaat — door redenering te ontleden in deterministische Extraheer-Redeneer-Voer-uit stappen die rekenkundige fouten elimineren.
Empirical comparison of RAG vs. unsupervised fine-tuning across 7B-parameter LLMs shows RAG achieves 0.875+ accuracy on post-cutoff facts while fine-tuning plateaus at 0.504 — with direct implications for Beancount agent design and any system requiring frequent knowledge updates.
Lewis et al.'s NeurIPS 2020 paper introduced the hybrid RAG architecture—a BART-large generator paired with a FAISS-indexed retriever over 21 million Wikipedia passages—achieving 44.5 EM on Natural Questions and establishing the parametric/non-parametric split that now underlies most production AI systems. This review covers RAG-Sequence vs. RAG-Token trade-offs, the retrieval collapse failure mode, and what stale indexes mean for financial AI built on append-only Beancount ledgers.
MultiHiertt (ACL 2022) introduces 10,440 QA pairs from real financial reports averaging 3.89 hierarchical tables each; state-of-the-art models score 38% F1 versus 87% for humans, with a 15-point penalty for cross-table questions — quantifying the retrieval gap finance AI must close.