12 posts tagged with "Financial Reporting"

AILLMMachine LearningFinanceFinancial ReportingData ScienceReconciliationBeancount

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

FinRAGBench-V (EMNLP 2025) is the first large-scale benchmark for multimodal RAG with visual citations in finance, covering 112K+ document pages and 1,394 human-annotated QA pairs. Top models achieve only 20–61% block-level citation recall, and multimodal retrieval outperforms text-only by nearly 50 percentage points.

LLMAIMachine LearningAnalyticsFinancial ReportingData ScienceReconciliation

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

Fin-RATE benchmarks 17 LLMs on 7,500 expert-curated QA pairs from 2,472 SEC filings, revealing an 18.60% accuracy collapse under longitudinal tracking and a 54-point drop for finance-specialized Fin-R1 on cross-entity tasks — with the retrieval pipeline, not the backbone model, as the binding bottleneck.

AILLMMachine LearningFinanceBeancountData ScienceFinancial Reporting

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

FinDER benchmarks RAG on 5,703 real hedge fund analyst queries against S&P 500 10-K filings; E5-Mistral achieves only 25.95% context recall, and abbreviation-heavy queries cost 8.2 precision points — evidence that query normalization, not better embeddings, is the first fix for finance AI pipelines.

AILLMMachine LearningFinanceFinancial ReportingData ScienceBeancount

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

DocFinQA replaces FinQA's curated 700-word passages with full 123,000-word SEC filings, exposing a 175× context increase that nearly halves GPT-4 accuracy on long documents. Retrieval pipelines fail to surface the right chunk 45% of the time at HR@3 — and long-context models are not a substitute.

LLMAIFinancial ReportingMachine LearningBeancountCompliance

FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks

FinAuditing tests 13 LLMs zero-shot on 1,102 real SEC XBRL filing instances; top scores are 13.86% on financial math verification and 12.42% on concept retrieval—results that directly bound what AI accounting tools can be trusted to automate without external tooling.

LLMAIMachine LearningFinanceFinancial ReportingData ScienceAutomation

TAT-LLM: Ge-fined-tunde LLaMA 2 voor discreet redeneren over financiële tabellen en tekst

TAT-LLM fine-tunt LLaMA 2 7B met LoRA op financiële tabel-tekst QA-benchmarks en behaalt 64,60% EM op FinQA — waarmee het de 63,91% van GPT-4 verslaat — door redenering te ontleden in deterministische Extraheer-Redeneer-Voer-uit stappen die rekenkundige fouten elimineren.

AIMachine LearningLLMFinancial ReportingFinancial StatementsData ScienceFinance

MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables

MultiHiertt (ACL 2022) introduces 10,440 QA pairs from real financial reports averaging 3.89 hierarchical tables each; state-of-the-art models score 38% F1 versus 87% for humans, with a 15-point penalty for cross-table questions — quantifying the retrieval gap finance AI must close.

AILLMMachine LearningFinanceFinancial ReportingData ScienceAnalytics

ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts

ConvFinQA (EMNLP 2022) extends FinQA into multi-turn conversation over S&P 500 earnings reports, finding that the best fine-tuned model achieves 68.9% execution accuracy versus 89.4% for human experts—and drops to 52.4% on hybrid multi-aspect conversations where models must carry numerical context across different financial topics.

AIMachine LearningLLMFinanceFinancial ReportingData Science

TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning

TAT-QA is a 16,552-question benchmark over hybrid table-plus-text financial report contexts that showed evidence grounding — not arithmetic — is the core bottleneck in finance AI; by 2024, fine-tuned 7B LLMs reached 83% F1, closing most of the gap against a 91% human ceiling.

AIMachine LearningLLMFinanceFinancial ReportingBeancount

FinQA: The Benchmark Measuring AI Numerical Reasoning on Financial Reports

FinQA (EMNLP 2021) built 8,281 QA pairs from S&P 500 earnings reports requiring multi-step arithmetic programs. Neural models scored 61% at release versus 91% for human experts; accuracy collapses to 22% on three-or-more-step programs. The failure modes — domain constants, cross-modality grounding, chain length — map directly to the challenges Beancount agents face today.

Everything About Financial Reporting

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks

TAT-LLM: Ge-fined-tunde LLaMA 2 voor discreet redeneren over financiële tabellen en tekst

MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables

ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts

TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning

FinQA: The Benchmark Measuring AI Numerical Reasoning on Financial Reports

Get started with Beancount.io

Getting Started

Features

Community

Legal