FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents

May 12, 2026 · 5 min read

Mike Thrift

Marketing Manager

FinanceBench arrives at a moment when every enterprise AI vendor claims their system can "answer questions from your financial documents." This paper from Patronus AI puts those claims to a hard test using real SEC filings and carefully curated open-book questions. The results are uncomfortable reading for anyone building finance AI.

The paper

2026-05-12-financebench-open-book-financial-qa-benchmark

Islam et al. introduce FinanceBench: A New Benchmark for Financial Question Answering (arXiv:2311.11944), a test suite of 10,231 questions about publicly traded companies drawn from real SEC filings — 10-K annual reports, 10-Q quarterly filings, 8-K current reports, and earnings transcripts. Unlike earlier finance QA datasets (FinQA, TAT-QA), which present pre-extracted tables and passages, FinanceBench requires a system to retrieve evidence from full documents before answering. That is the realistic setting. The questions are designed to be factually unambiguous and, in the authors' words, "a minimum performance standard."

The team evaluated 16 configurations spanning GPT-4-Turbo, Llama2, and Claude2 across four retrieval strategies: closed-book (no retrieval), shared vector store, per-document vector store, and long-context prompts feeding the full relevant page. Human annotators manually reviewed all 2,400 responses across 150 open-source cases.

Key ideas

Retrieval is not the bottleneck. GPT-4-Turbo given the oracle passage — the exact page containing the answer — still only reaches 85% accuracy. Long-context prompting (feeding the right page automatically) scores 79%. Perfect retrieval buys you six points.
Vector-store RAG is the real problem. GPT-4-Turbo with a per-document vector store: 50% correct, 39% refused. With a shared vector store across companies: 19% correct, 68% refused. The headline "81% failure rate" comes from that shared-store setup — the configuration most enterprise demos actually use.
Models fail differently. Llama2 hallucinates aggressively (54–70% incorrect); GPT-4-Turbo refuses (39–68% refused rather than wrong). Both failure modes are unacceptable in production, but they are not equivalent risks.
66% of questions require numerical reasoning. Growth rates, margins, year-over-year deltas. This is where models most commonly err — calculation mistakes, unit confusion, sign errors.
Long context nearly rescues it. Claude2 long context: 76% correct. GPT-4-Turbo long context: 79%. These are the best practical numbers, achieved by skipping retrieval and feeding the whole relevant page directly.
Even the oracle leaks. With perfect evidence, the ceiling is 85%, not 100%. Fifteen percent of failures are pure reasoning failures with no retrieval component.

What holds up — and what doesn't

The benchmark's design is sound. Insisting on real documents over pre-extracted snippets is the correct methodological choice — it tests what actually matters in deployment. The manual evaluation of 2,400 responses is expensive and credible.

What I find less convincing is drawing rankings from n=150. The difference between Claude2 long context (76%) and GPT-4-Turbo long context (79%) is statistically meaningless at that sample size, but the paper presents it as a ranking. The full 10,231-question benchmark exists but isn't publicly scored, which limits independent reproduction.

The oracle result is also the most important and least analyzed finding. If models fail 15% of the time with the correct page in hand, the problem is reasoning and arithmetic, not retrieval. The paper flags calculator tools and chain-of-thought as future work — those experiments should have been the center of this paper, not the footnote.

The benchmark also acknowledges it targets "minimum performance": single-document questions with unambiguous answers. Cross-document reasoning, multi-year trends, and inter-company comparisons are excluded. Papers citing the 79% long-context number will rarely carry that caveat.

Why this matters for finance AI

The Beancount write-back use case maps almost directly onto FinanceBench's failure modes. An agent that retrieves a transaction entry and checks whether the amount matches a bank statement is doing the same retrieval-then-arithmetic task this benchmark measures. The oracle ceiling — 85% even with perfect context — is the relevant design constraint: even if the agent finds the right ledger entry, there is a real probability it will miscalculate the comparison, confuse the sign, or misread the units.

The Llama2/GPT-4 failure split matters for agent architecture. A refusal is recoverable (route to human review); a hallucinated match committed to the ledger is not. This argues for preferring conservative refusal behavior over confident hallucination, even at the cost of a lower apparent success rate.

The long-context advantage (79% vs. 50%) is practically frustrating for ledger applications. Multi-year Beancount files are too large to feed in full. Solving retrieval over dense numerical documents — not just text retrieval — remains an open problem.

FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next