35 posts tagged with "Finance"

AILLMMachine LearningAutomationTechnologyPerformanceFinance

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

A 2026 Stanford preprint equalizes thinking-token budgets across five multi-agent architectures and finds single-agent LLMs match or beat multi-agent systems on multi-hop reasoning — with theoretical grounding in the Data Processing Inequality and implications for finance AI agent design.

AILLMMachine LearningAutomationFinanceData ScienceMulti-Agent

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

M3MAD-Bench stress-tests Multi-Agent Debate across 9 models, 5 domains, and vision-language settings, finding that Collective Delusion causes 65% of failures, adversarial debate cuts accuracy by up to 12.8%, and Self-Consistency typically matches debate accuracy at lower token cost.

AIMachine LearningLLMData ScienceBeancountFinanceAutomation

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.

AIMachine LearningForecastingData ScienceLLMFinanceBeancount

LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI

A NeurIPS 2024 Spotlight paper ablates three LLM-based time series forecasting methods — OneFitsAll, Time-LLM, and CALF — and finds that removing the language model improves accuracy in most cases, with up to a 1,383× training speedup. For finance AI applications like Beancount balance prediction, lightweight purpose-built models consistently beat repurposed LLMs.

LLMAIMachine LearningFinanceFinancial ReportingData ScienceAutomation

TAT-LLM: Ge-fined-tunde LLaMA 2 voor discreet redeneren over financiële tabellen en tekst

TAT-LLM fine-tunt LLaMA 2 7B met LoRA op financiële tabel-tekst QA-benchmarks en behaalt 64,60% EM op FinQA — waarmee het de 63,91% van GPT-4 verslaat — door redenering te ontleden in deterministische Extraheer-Redeneer-Voer-uit stappen die rekenkundige fouten elimineren.

AILLMMachine LearningAutomationPlain-Text AccountingBeancountFinance

IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA

IRCoT interleaves BM25 retrieval with each step of a chain-of-thought reasoning loop, achieving +11.3 retrieval recall and +7.1 F1 on HotpotQA over one-step RAG — and shows a 3B model can beat GPT-3 175B when retrieval strategy is right.

AIMachine LearningLLMRetrieval-Augmented GenerationBeancountFinanceAutomation

FLARE: Active Retrieval Augmented Generation

FLARE (EMNLP 2023) improves on standard RAG by triggering retrieval mid-generation using token-probability confidence thresholds, reaching 51.0 EM on 2WikiMultihopQA versus 39.4 for single-retrieval — but calibration failures in instruction-tuned chat models limit its reliability for production finance agents.

AIMachine LearningLLMFinancial ReportingFinancial StatementsData ScienceFinance

MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables

MultiHiertt (ACL 2022) introduces 10,440 QA pairs from real financial reports averaging 3.89 hierarchical tables each; state-of-the-art models score 38% F1 versus 87% for humans, with a 15-point penalty for cross-table questions — quantifying the retrieval gap finance AI must close.

AILLMMachine LearningFinanceFinancial ReportingData ScienceAnalytics

ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts

ConvFinQA (EMNLP 2022) extends FinQA into multi-turn conversation over S&P 500 earnings reports, finding that the best fine-tuned model achieves 68.9% execution accuracy versus 89.4% for human experts—and drops to 52.4% on hybrid multi-aspect conversations where models must carry numerical context across different financial topics.

AIMachine LearningLLMFinanceFinancial ReportingData Science

TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning

TAT-QA is a 16,552-question benchmark over hybrid table-plus-text financial report contexts that showed evidence grounding — not arithmetic — is the core bottleneck in finance AI; by 2024, fine-tuned 7B LLMs reached 83% F1, closing most of the gap against a 91% human ceiling.

Everything About Finance

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI

TAT-LLM: Ge-fined-tunde LLaMA 2 voor discreet redeneren over financiële tabellen en tekst

IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA

FLARE: Active Retrieval Augmented Generation

MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables

ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts

TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning

Get started with Beancount.io

Getting Started

Features

Community

Legal