FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG
FinDER (arXiv:2504.15800) is a retrieval benchmark built around a simple but underappreciated observation: the queries real financial professionals type look nothing like the polished questions in academic benchmarks. I'm reading it because it sits at the intersection of two threads I've been tracking — the retrieval gap in finance AI, and the practical realism problem that DocFinQA and FinanceBench started to expose.
The paper
Chanyeol Choi, Jihoon Kwon, and colleagues at a financial AI firm present a dataset of 5,703 expert-annotated query–evidence–answer triplets sourced from a real hedge fund analyst Q&A service. The documents are Form 10-K filings from 490 S&P 500 companies, collected from SEC EDGAR. What distinguishes FinDER from prior benchmarks is the query side: 89.86% of queries contain three or more domain-specific abbreviations or acronyms. Instead of "What is the total revenue of Company X for fiscal year 2023?", a real analyst might type "GOOGL 10-K FY23 revs breakdown by segment." The dataset was published at the ICLR 2025 Workshop on Advances in Financial AI and later appeared at ICAIF 2025.
Key ideas
- Retrieval recall is shockingly low across the board: E5-Mistral (best dense retriever) achieves only 25.95% context recall overall; BM25 manages 11.68%. The "Financials" category — the one most directly relevant to accounting — is the hardest: 15.84% and 6.42% respectively.
- Query ambiguity alone costs 8.2 precision points: Testing E5-Mistral on 500 queries, the authors compare well-formed paraphrases (33.9 precision) against the real abbreviated queries (25.7 precision). The gap is entirely attributable to abbreviation/acronym handling, not document complexity.
- Retrieval quality is the dominant bottleneck for generation: LLMs with no context score near-zero (9–10% correct); with top-10 retrieved passages they reach 29–34%; with perfect oracle context they jump to 60–68%. That 35-point gap between realistic and oracle conditions is bigger than the gap between open-source and frontier models.
- Compositional arithmetic breaks even with good retrieval: Multi-step calculation tasks (compositional queries) reach only ~20% correctness across all four models — Claude-3.7-Sonnet, GPT-o1, DeepSeek-R1-Distill, and Qwen-QWQ — even with the top-10 retrieved passages. GPT-o1 leads multiplication tasks at 42.90% but falls to 27.78% on division.
- LLM reranking adds modest but consistent improvement: Letting models rerank the top-10 E5-Mistral hits before answering, Claude-3.7-Sonnet achieves F1 of 63.05 and GPT-o1 reaches 62.90. Deepseek-R1-Distill trails at 60.01, despite strong performance on structured reasoning elsewhere.
- Category difficulty is uneven: Risk queries are easiest to retrieve (E5-Mistral: 33.07 recall); Financials remain hardest (15.84). This correlates with query structure — risk disclosures use natural language prose, financial tables use dense numeric notation.
What holds up — and what doesn't
The core contribution is solid: this is a real query distribution from working analysts, and the abbreviation problem is genuine. Any benchmark built from Wikipedia or FinQA-style crowdsourcing misses this. The three-tier evaluation structure — no context, realistic retrieval, oracle context — is the right design; it cleanly separates retrieval quality from reasoning quality and shows the residual generation gap (still ~32–34% failure even with perfect context on qualitative questions).
Where the paper is weakest is reproducibility. At the time of publication, the dataset was not publicly available — the authors state they "plan to release it publicly at a later time." This is a significant problem for a workshop paper presenting itself as an evaluation standard. Benchmarks that aren't released are not benchmarks; they're case studies. It has since appeared at ICAIF 2025, so release may have followed, but the arXiv version does not confirm this.
The retrieval evaluation also uses only four single-stage models (BM25, GTE, mE5, E5-Mistral). There is no hybrid retrieval, no query expansion, no HyDE, no rewriting step targeting the abbreviation problem specifically. Given that the authors have precisely characterized the abbreviation gap, it's surprising they don't test the obvious fix: expand the query ("GOOGL" → "Alphabet Inc.") before retrieval. That experiment is absent.
The generation results deserve a closer read. The ~9–10% no-context performance is not a useful lower bound — it's essentially zero — but the 60–68% oracle ceiling is more informative than it appears. Even with the correct passage in hand, the best models fail on roughly one-third of qualitative questions and four-fifths of compositional arithmetic. That ceiling matters: it means retrieval alone cannot solve the problem.
Why this matters for finance AI
The query distribution in FinDER maps well onto how Beancount users actually interact with a ledger agent. A user who has been maintaining their accounts for years will type abbreviated, contextual queries — "AMZN card Q3 reimb?" rather than "What are the Amazon credit card reimbursements in Q3?" Standard embedding models will fail to retrieve the right entries because they were trained on clean natural-language text. The 8.2-point precision drop from clean to real queries is probably conservative for a personal ledger domain, where idiosyncratic shorthand ("prop mgmt fee" for "property management fee") is even further from training data than SEC-standard abbreviations.
The 25.95% context recall ceiling on E5-Mistral is a forcing function: any Beancount RAG pipeline needs to budget for a large fraction of missed evidence. One implication is that high-recall re-retrieval (multiple passes, diversified query formulations) matters more than pushing F1 on a single pass. Another is that query normalization — mapping user shorthand to canonical account names before retrieval — should be an explicit preprocessing step, not left to the embedding model.
The 20% compositional arithmetic accuracy even with oracle context is a separate signal: for Beancount calculation tasks, the generation bottleneck is reasoning, not retrieval. PAL-style offloading (generating Python arithmetic rather than free-text calculation) remains the right answer for numeric tasks regardless of how good retrieval gets.
What to read next
- Fin-RATE (arXiv:2602.07294) — the companion benchmark for multi-period tracking on SEC filings; accuracy drops 18.60% on temporal tasks, which is the Beancount multi-year ledger problem stated directly.
- IRCoT (arXiv:2212.10509, ACL 2023) — interleaving retrieval with chain-of-thought reasoning; the multi-pass retrieval structure directly addresses the low single-pass recall FinDER exposes.
- Query expansion with LLMs for domain-specific retrieval — no single benchmark paper covers this well yet, but the FinDER abbreviation gap makes it a first-order research priority; searching for "HyDE financial domain" and "query expansion SEC filings 2025" is the right starting point.
