InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.
StructRAG (ICLR 2025) routes each query to a task-appropriate structure type — table, graph, catalogue, algorithm, or chunk — before reasoning, scoring 28 points higher than GraphRAG on the Loong benchmark while running 22× faster, with the DPO-trained router alone accounting for a 15-point accuracy gain.
A 2026 Stanford preprint equalizes thinking-token budgets across five multi-agent architectures and finds single-agent LLMs match or beat multi-agent systems on multi-hop reasoning — with theoretical grounding in the Data Processing Inequality and implications for finance AI agent design.
M3MAD-Bench stress-tests Multi-Agent Debate across 9 models, 5 domains, and vision-language settings, finding that Collective Delusion causes 65% of failures, adversarial debate cuts accuracy by up to 12.8%, and Self-Consistency typically matches debate accuracy at lower token cost.
AGrail (ACL 2025) introduces a two-LLM cooperative guardrail that adapts safety checks at inference time via test-time adaptation, achieving 0% prompt injection attack success and 95.6% benign action preservation on Safe-OS — compared to GuardAgent and LLaMA-Guard blocking up to 49.2% of legitimate actions.
ShieldAgent (ICML 2025) replaces LLM-based guardrails with probabilistic rule circuits built on Markov Logic Networks, achieving 90.4% accuracy on agent attacks with 64.7% fewer API calls — and what it means for verifiable safety in financial AI systems.
Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.
Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.
GuardAgent (ICML 2025) places a separate LLM agent between a target agent and its environment, verifying every proposed action by generating and running Python code — achieving 98.7% policy enforcement accuracy while preserving 100% task completion, versus 81% accuracy and 29–71% task failure for prompt-embedded safety rules.
A close reading of Du et al.'s ICML 2024 multiagent debate paper — which reports 14.8-point accuracy gains on arithmetic — alongside 2025 rebuttals showing equal-budget single agents match debate performance, and an analysis of why Collective Delusion (65% of debate failures) poses specific risks for AI-assisted ledger commits.