ReDAct runs a small model by default and escalates to an expensive model only when token-level perplexity signals uncertainty, achieving 64% cost savings over GPT-5.2-only while matching or exceeding its accuracy — a directly applicable pattern for Beancount transaction-categorization agents.
OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.
Fin-RATE benchmarks 17 LLMs on 7,500 expert-curated QA pairs from 2,472 SEC filings, revealing an 18.60% accuracy collapse under longitudinal tracking and a 54-point drop for finance-specialized Fin-R1 on cross-entity tasks — with the retrieval pipeline, not the backbone model, as the binding bottleneck.
FinDER benchmarks RAG on 5,703 real hedge fund analyst queries against S&P 500 10-K filings; E5-Mistral achieves only 25.95% context recall, and abbreviation-heavy queries cost 8.2 precision points — evidence that query normalization, not better embeddings, is the first fix for finance AI pipelines.
The TACL 2024 paper by Liu et al. shows LLMs perform up to 20 points worse on information buried in the middle of long contexts — a U-shaped degradation affecting every tested model including Claude-1.3-100K — with concrete implications for how RAG pipelines should order retrieved passages in finance and accounting applications.
AD-LLM benchmarks GPT-4o and Llama 3.1 8B across three anomaly detection roles — zero-shot detector, data augmenter, and model selector — on five NLP datasets; GPT-4o reaches AUROC 0.93–0.99 zero-shot, but LLM-based model selection remains unreliable, with direct implications for financial audit AI.
CausalTAD improves LLM-based tabular anomaly detection by reordering table columns to respect causal dependencies before serialization, lifting average AUC-ROC from 0.803 to 0.834 over AnoLLM on mixed-type benchmarks — with direct implications for detecting anomalies in structured ledger data.
AnoLLM (ICLR 2025) reformulates tabular anomaly detection as LLM density estimation — fine-tuning on normal rows and scoring by negative log-likelihood. It outperforms classical methods on mixed-type fraud datasets but offers no edge on purely numerical data, with real implications for detecting anomalies in Beancount ledger entries.
The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.