35 posts tagged with "Finance"

AILLMMachine LearningFinanceFinancial ReportingData ScienceReconciliationBeancount

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

FinRAGBench-V (EMNLP 2025) is the first large-scale benchmark for multimodal RAG with visual citations in finance, covering 112K+ document pages and 1,394 human-annotated QA pairs. Top models achieve only 20–61% block-level citation recall, and multimodal retrieval outperforms text-only by nearly 50 percentage points.

LLMAIMachine LearningTrustFinanceData ScienceHallucination Detection

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

A systematic survey of LLM confidence estimation and calibration methods—white-box logit approaches, consistency-based SelfCheckGPT, and semantic entropy—reveals that verbalized confidence scores from GPT-4 achieve only ~62.7% AUROC, barely above chance, with direct implications for deploying uncertainty-aware agents in finance and accounting.

LLMAIFinanceFintechAutomationBeancountMachine Learning

FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks

FinTrace benchmarks 13 LLMs on 800 expert-annotated financial task trajectories across 9 metrics, finding that frontier models achieve strong tool selection (F1 ~0.9) but score only 3.23/5 on information utilization — the step where agents reason over what tools return.

AIMachine LearningLLMFinanceData ScienceBeancountAutomation

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

OmniEval (EMNLP 2025) benchmarks RAG systems across 5 task types × 16 financial topics using 11.4k auto-generated test cases. The best systems achieve only 36% numerical accuracy — concrete evidence that RAG pipelines need validation layers before writing to structured financial ledgers.

AILLMMachine LearningFinanceBeancountData ScienceFinancial Reporting

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

FinDER benchmarks RAG on 5,703 real hedge fund analyst queries against S&P 500 10-K filings; E5-Mistral achieves only 25.95% context recall, and abbreviation-heavy queries cost 8.2 precision points — evidence that query normalization, not better embeddings, is the first fix for finance AI pipelines.

LLMAIMachine LearningData ScienceFinanceTechnologyAnalytics

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

The TACL 2024 paper by Liu et al. shows LLMs perform up to 20 points worse on information buried in the middle of long contexts — a U-shaped degradation affecting every tested model including Claude-1.3-100K — with concrete implications for how RAG pipelines should order retrieved passages in finance and accounting applications.

AILLMMachine LearningFraud DetectionData ScienceBeancountFinance

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

AnoLLM (ICLR 2025) reformulates tabular anomaly detection as LLM density estimation — fine-tuning on normal rows and scoring by negative log-likelihood. It outperforms classical methods on mixed-type fraud datasets but offers no edge on purely numerical data, with real implications for detecting anomalies in Beancount ledger entries.

AILLMMachine LearningFinanceFinancial ReportingData ScienceBeancount

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

DocFinQA replaces FinQA's curated 700-word passages with full 123,000-word SEC filings, exposing a 175× context increase that nearly halves GPT-4 accuracy on long documents. Retrieval pipelines fail to surface the right chunk 45% of the time at HR@3 — and long-context models are not a substitute.

AILLMAutomationMachine LearningFinanceEnterprise SoftwareProductivity

TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks

TheAgentCompany tests 175 real workplace tasks across a simulated intranet with GitLab, OwnCloud, and RocketChat. The best model (Gemini-2.5-Pro) completes only 30% of tasks at $4 each, revealing that autonomous agents remain far from viable for accounting and finance workflows.

LLMAIFinanceMachine LearningForecastingDecision-makingData Science

InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions

InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.

Everything About Finance

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks

InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions

Get started with Beancount.io

Getting Started

Features

Community

Legal