FinRAGBench-V (EMNLP 2025) is the first large-scale benchmark for multimodal RAG with visual citations in finance, covering 112K+ document pages and 1,394 human-annotated QA pairs. Top models achieve only 20–61% block-level citation recall, and multimodal retrieval outperforms text-only by nearly 50 percentage points.
A systematic survey of LLM confidence estimation and calibration methods—white-box logit approaches, consistency-based SelfCheckGPT, and semantic entropy—reveals that verbalized confidence scores from GPT-4 achieve only ~62.7% AUROC, barely above chance, with direct implications for deploying uncertainty-aware agents in finance and accounting.
FinTrace benchmarks 13 LLMs on 800 expert-annotated financial task trajectories across 9 metrics, finding that frontier models achieve strong tool selection (F1 ~0.9) but score only 3.23/5 on information utilization — the step where agents reason over what tools return.
OmniEval (EMNLP 2025) benchmarks RAG systems across 5 task types × 16 financial topics using 11.4k auto-generated test cases. The best systems achieve only 36% numerical accuracy — concrete evidence that RAG pipelines need validation layers before writing to structured financial ledgers.
FinDER benchmarks RAG on 5,703 real hedge fund analyst queries against S&P 500 10-K filings; E5-Mistral achieves only 25.95% context recall, and abbreviation-heavy queries cost 8.2 precision points — evidence that query normalization, not better embeddings, is the first fix for finance AI pipelines.
The TACL 2024 paper by Liu et al. shows LLMs perform up to 20 points worse on information buried in the middle of long contexts — a U-shaped degradation affecting every tested model including Claude-1.3-100K — with concrete implications for how RAG pipelines should order retrieved passages in finance and accounting applications.
AnoLLM (ICLR 2025) reformulates tabular anomaly detection as LLM density estimation — fine-tuning on normal rows and scoring by negative log-likelihood. It outperforms classical methods on mixed-type fraud datasets but offers no edge on purely numerical data, with real implications for detecting anomalies in Beancount ledger entries.
DocFinQA replaces FinQA's curated 700-word passages with full 123,000-word SEC filings, exposing a 175× context increase that nearly halves GPT-4 accuracy on long documents. Retrieval pipelines fail to surface the right chunk 45% of the time at HR@3 — and long-context models are not a substitute.
TheAgentCompany tests 175 real workplace tasks across a simulated intranet with GitLab, OwnCloud, and RocketChat. The best model (Gemini-2.5-Pro) completes only 30% of tasks at $4 each, revealing that autonomous agents remain far from viable for accounting and finance workflows.
InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.