Skip to main content

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

· 5 min read
Mike Thrift
Mike Thrift
Marketing Manager

DocFinQA is a 2024 ACL paper that takes the existing FinQA dataset and re-presents each question alongside the complete SEC filing it came from — expanding average context from under 700 words to 123,000 words. I'm reading it because it directly tests the scenario every production Beancount agent faces: not a tidy extracted passage, but the whole messy document. The results are sobering for anyone planning to deploy long-context models over multi-year ledgers.

The paper

DocFinQA: A Long-Context Financial Reasoning Dataset — Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, and Chris Tanner (ACL 2024, Short Papers) — takes the 8,281 QA pairs from FinQA and augments 7,621 of them with the full SEC annual report each question originally came from. The result is 1,236 unique filings split across 5,798 training, 791 dev, and 1,032 test examples, with average context ballooning 175× from roughly 700 words to 123,453 words.

2026-06-20-docfinqa-long-context-financial-reasoning-dataset

The question set is unchanged — these are the same multi-step numerical reasoning questions requiring Python programs to answer. What changes is that the model now receives the full filing rather than an expertly curated 700-word passage. The research compares two families of approach: classic retrieval pipelines (chunk, rank, answer) and emerging long-context LLMs that attempt to process the full document end-to-end.

Key ideas

  • Best retrieval pipeline accuracy on the test set: GPT-3.5 at 42.64%. Open-source models lag well behind: Mistral/7B at 24.97%, CodeLlama/13B at 21.01%, MPT/30B at 18.07%.
  • The best retrieval encoder — a fine-tuned ColBERT — achieves HR@1 = 0.35 and HR@3 = 0.55, meaning the correct chunk is absent from the model's context nearly half the time even when retrieving three passages.
  • Long-context GPT-4 (evaluated on a 400-question subsample): 46.5% on shorter documents (≤100K tokens) versus 23.0% with a Summarize-then-Answer strategy on the longest documents (>100K tokens). GPT-4 makes nearly twice as many errors on long documents as on short ones.
  • Finance-specific PDF parsing (Kensho Extract) substantially outperformed generic HTML parsing (BeautifulSoup), particularly for table preservation — a practical finding for any pipeline built on SEC filings.
  • A substantial fraction of relevant chunks lie beyond document position 250, meaning truncation-based strategies silently discard the right evidence before the model ever sees it.

What holds up — and what doesn't

The core empirical contribution is solid: the dataset is a faithful extension of FinQA with well-defined methodology (four-gram similarity scoring to identify golden chunks, 2,750-character chunks with 20% overlap), and the finding that performance degrades severely with document length is consistent across both retrieval and long-context approaches. The near-doubling of GPT-4 errors on long documents versus short is striking and hard to explain away.

What the paper doesn't fully address is the frontier of 2024-vintage long-context models. The long-context evaluation covers only 400 samples, limited by cost, and does not test Gemini 1.5 Pro (1M-token window) or Claude 3 (200K). The chunking hyperparameters are reasonable but not systematically ablated, and the Summarize-then-Answer multi-call strategy is probably not the best available — IRCoT's interleaved retrieval and StructRAG's structured synthesis both suggest better approaches exist for multi-hop evidence aggregation in long documents.

The finetuned ColBERT hitting HR@3 = 0.55 reveals the deeper problem: retrieval over long financial documents is itself unsolved. Even with a perfect generative model, nearly half of queries would receive an answer built from the wrong passages. The paper surfaces this as the binding constraint but stops short of quantifying how much accuracy recovers when retrieval is made oracle.

Why this matters for finance AI

Multi-year Beancount ledgers don't average 123K words by default, but a decade of transactions with detailed memos easily reaches it, and a finance agent operating over full annual reports faces exactly this regime. The compression from "we cherry-picked the right 700 words" (FinQA) to "here is the full 10-Q" (DocFinQA) represents the gap between a toy benchmark and production reality. DocFinQA makes that gap measurable.

The near-50% drop in GPT-4 accuracy from short to long documents argues against a simple "just use a bigger context window" response. Retrieval remains necessary but is only 55% reliable at HR@3. For a Beancount write-back agent needing to locate a depreciation schedule buried in a year-old note-to-accounts, neither architecture gives you the reliability you'd want before committing a journal entry. The honest reading of this paper: better retrieval, better evidence aggregation, and explicit evaluation of silent failures — not a larger context window — are what the field actually needs.

  • "Lost in the Middle: How Language Models Use Long Contexts" — Liu et al., 2023, arXiv:2307.03172. Provides the mechanistic explanation for the positional accuracy collapse DocFinQA measures, with the now-canonical U-shaped performance curve.
  • FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation — arXiv:2504.15800, ICLR 2025 Workshop. A 2025 successor benchmark with 5,703 query-evidence-answer triplets designed around realistic professional financial search queries, including abbreviations and acronyms that standard retrievers miss.
  • Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings — arXiv:2602.07294. A newer SEC-filing benchmark that adds temporal tracking tasks beyond single-document QA, closer to what a Beancount audit agent would actually need.