FinQA: The Benchmark Measuring AI Numerical Reasoning on Financial Reports
FinanceBench showed last week that retrieval is not the hard part of finance QA — numerical reasoning is. FinQA, published at EMNLP 2021, is the paper that established why. I read it now because it is the foundational benchmark for financial arithmetic; every subsequent work in this space either extends it or benchmarks against it, and understanding where its models fail explains where current Beancount agents will fail too.
The paper
Zhiyu Chen, Wenhu Chen, and colleagues from UC Santa Barbara, J.P. Morgan, and Amazon introduced FinQA: A Dataset of Numerical Reasoning over Financial Data (arXiv:2109.00122, EMNLP 2021). The core task: given an earnings report containing both a prose narrative and one or more financial tables, answer a question that requires multi-step arithmetic over facts drawn from both modalities. The answer must be derived via an explicit numerical program — a sequence of up to five operations (addition, subtraction, multiplication, division, comparison, table aggregation, and a handful of others) applied to extracted values.
Eleven US-based finance professionals (CPAs, MBAs) built the dataset by hand from 2,789 pages of S&P 500 earnings reports spanning 1999–2019. The final dataset contains 8,281 annotated QA pairs, each with gold supporting facts and the full reasoning program, making it fully executable and auditable.
Key ideas
- The gap is brutal at release time. FinQANet (RoBERTa-large), the best neural model the authors could field, reached 61.24% execution accuracy and 58.86% program accuracy on the test set. Human financial experts scored 91.16% and 87.49%. Non-expert crowd workers hit only 50.68% — barely above the neural baseline, which tells you the domain requires real expertise, not just reading comprehension.
- Multi-step is where everything breaks. For programs requiring three or more reasoning steps, FinQANet accuracy collapses to 22.78%. The model can handle two-step arithmetic reasonably; anything longer and error compounds.
- Cross-modality questions are the hard case. Questions whose evidence spans both the table and the prose see 43.80% accuracy, roughly 17 points below the overall average. Grounding a number from a table paragraph to a qualifier in the text is not something standard pre-trained models do well.
- Domain constants are a silent killer. When a program step requires a constant that is financial convention (e.g., there are 1,000 thousands in a million, or that a basis point is 0.01%) rather than something stated in the document, accuracy drops to 43.88%. The model cannot reliably distinguish "this number is in the document" from "this number is world knowledge."
- ~50% of errors trace to domain knowledge gaps, not retrieval failures or arithmetic execution errors. The model found the right facts but applied wrong financial logic.
- Later LLMs close the gap substantially but do not eliminate it. GPT-4 is reported at roughly 76% execution accuracy on FinQA, and task-specific SOTA systems reached around 89% by 2024 — still below human expert performance.
What holds up — and what doesn't
The benchmark design is sound. Using executable programs instead of free-text answers is the right call: you can score a model unambiguously, and you get a window into how it reasoned, not just whether it was right. The decision to require both table and text evidence reflects real-world financial analysis, where the table gives you the number and the footnote tells you what the number means.
That said, the task is narrower than it looks. The predefined DSL of operations covers standard financial arithmetic, but it cannot represent a categorization decision ("is this expense recurring or one-time?"), a policy check ("does this cash flow comply with our budget policy?"), or anything requiring external retrieval of market data or accounting standards. The programs are correct and explainable, but they live in a world where the only uncertainty is arithmetic, not judgment.
The retrieval setup also gives the model gold supporting facts during training, which flatters the numbers. In a real deployment you would have to retrieve the right table cells from a long document before you could execute the program — and that retrieval step is not trivial, as FinanceBench showed last week.
Finally, the 2021 results understate current model capability. The ~61% baseline was pre-ChatGPT. The ~76% GPT-4 number and ~89% SOTA numbers come from specialized pipelines that combine chain-of-thought, code execution, and fine-tuning. The gap to human expert (91%+) has narrowed but persists.
Why this matters for finance AI
Beancount ledgers are essentially stripped-down earnings reports: structured rows of debits and credits with prose metadata in transaction notes, payee fields, and account hierarchies. Every skill the FinQA benchmark tests maps directly to something a Beancount agent must do.
The cross-modality failure mode is particularly important. In a Beancount context, an agent might see a transaction amount in the ledger, a foreign-currency rate in a price directive, and a comment in the note field — and need all three to compute the correct reporting-currency value. The models that FinQA tested in 2021 could not reliably cross-reference those sources. Current LLMs do better, but the 22.78% accuracy on 3+-step programs is a warning: chain length is a real failure axis, and multi-step ledger reconciliation tasks will hit it.
The domain constants problem also generalises. Accounting has its own conventions — double-entry invariants, account type semantics, fiscal year boundaries — that a model must know without being told. FinQA's error analysis showing ~50% domain-knowledge failures suggests that a Beancount agent needs either fine-tuning on accounting conventions or an explicit retrieval layer for accounting rules, not just ledger entries.
The benchmark's program representation, though constrained, also points toward how Beancount agents should express their reasoning: not natural language that could be vague, but executable operations that can be checked, rolled back, or audited.
What to read next
- TAT-QA (arXiv:2105.07624, ACL 2021) — extends the hybrid table+text setting to 16,552 questions with a richer variety of reasoning types; the TAGOP model it introduces is worth studying for how it handles span extraction from both modalities jointly.
- ConvFinQA (arXiv:2210.03849, EMNLP 2022) — the conversational extension of FinQA, where each dialogue has cross-turn numerical dependencies; the multi-turn structure maps directly to an interactive Beancount assistant that must track running calculations across user follow-ups.
- MultiHiertt (arXiv:2206.01347, ACL 2022) — pushes the setting to financial reports with multiple hierarchical tables per document; a necessary step toward the consolidation statements and multi-year ledger views that Beancount agents will face.
