Skip to main content

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

· 5 min read
Mike Thrift
Mike Thrift
Marketing Manager

Retrieval-augmented generation lives or dies by how well the generator can synthesize evidence spread across multiple documents. Izacard and Grave's 2021 EACL paper, "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering," proposes a deceptively simple architectural fix — encode passages independently, fuse them all in the decoder — that outperforms the then-dominant RAG framework by a significant margin. I'm reading it now because the design principle maps directly to ledger QA: before deciding how to retrieve entries in Beancount agents, it's worth understanding which fusion strategy actually works.

The paper

2026-05-26-fusion-in-decoder-passage-retrieval-generative-qa

Lewis et al.'s original RAG (arXiv:2005.11401) marries a dense retriever with a BART generator but forces the generator to condition on a single retrieved passage at a time, marginalizing over passages either per-sequence (RAG-Sequence) or per-token (RAG-Token). Izacard and Grave identified this as the binding constraint: a model that can only see one passage at a time cannot easily triangulate across evidence that is scattered across documents.

Their FiD (Fusion-in-Decoder) solution is elegant. Each retrieved passage is concatenated with the question, then encoded independently by T5's encoder. The encoder runs once per passage — fully parallelizable. The decoder then performs cross-attention over the concatenation of all passage representations simultaneously. The encoder complexity scales linearly with the number of passages; the decoder, crucially, can attend across passage boundaries during every generation step. The paper uses T5-base and T5-large as the generator backbone.

Key ideas

  • FiD-large with 100 retrieved passages achieves 51.4% exact match on Natural Questions and 67.6% on TriviaQA open, compared to RAG-Sequence's 47.5% and 56.1% respectively — gains of roughly 4 and 11 points.
  • Performance on Natural Questions scales monotonically with passage count: 37.3% at 1 passage, 48.8% at 10, 50.8% at 50, 51.4% at 100. The marginal return diminishes but never reverses.
  • TriviaQA improves 6% and NaturalQuestions 3.5% when scaling from 10 to 100 passages — evidence that the decoder is genuinely aggregating, not just picking the top passage.
  • The encoding step is cheap to parallelize: each (question, passage) pair is processed independently, so wall-clock time scales sub-linearly with hardware.
  • FiD-base with 770M parameters surpasses T5-11B closed-book (44.1% vs. 36.6% on NQ), demonstrating that retrieval makes smaller models punch far above their weight.

What holds up — and what doesn't

The core result is robust and has been replicated extensively. The architectural insight — independent encoding, joint decoding — is genuinely clean: it avoids the quadratic self-attention blowup that would result from naively concatenating all passages before the encoder, while still giving the decoder global context over all retrieved evidence.

The limitation the paper barely acknowledges is that the decoder's cross-attention is the bottleneck at inference time. Cross-attention must load all encoder key-value pairs per decoder layer per generation step, and those key-value tensors grow linearly with passage count. A 2023 follow-up, FiDO (arXiv:2212.08153), showed that replacing multi-head attention with multi-query attention and pruning cross-attention layers yields a 7x inference speedup with minimal accuracy loss — which implies the original FiD decoder is substantially over-engineered for what the task requires.

There is also a calibration gap the paper does not explore: it reports exact match, which rewards systems that happen to produce the precise canonical answer string. For factual synthesis tasks — summarizing findings across multiple passages rather than extracting a span — exact match understates errors and overstates confidence. In finance settings, where a wrong number in an otherwise correct sentence is a serious failure, exact match is the wrong metric entirely.

Why this matters for finance AI

Beancount ledger QA is a multi-passage retrieval problem by nature. A question like "What did I spend on travel in Q3 across all accounts?" requires synthesizing dozens of transaction entries from different dates, accounts, and commodity types. FiD's core finding — that generative models can aggregate across many retrieved passages, and that performance improves with more context — is directly encouraging.

The practical design implication is concrete: when building a Beancount QA layer, retrieving more candidate entries (50–100 rather than the usual top-5) and giving the generator joint access to all of them is likely better than relying on re-ranking to pick one right answer. The FiD architecture also maps cleanly to ledger structure: each transaction entry can be encoded independently (cheap, parallelizable) before the decoder synthesizes across all of them.

The inference cost concern is real for production deployments, but FiDO's follow-up shows it is solvable at the architecture level without accuracy penalty. The more pressing limitation for finance agents is that FiD is designed for factoid QA with short generative outputs. Ledger analysis often requires multi-step arithmetic — adding up amounts, computing ratios — and FiD's generator does not inherently route that to an interpreter. Combining FiD-style fusion with a PAL-style code-generation head is the natural next step for numeric accuracy.

  • FiDO (arXiv:2212.08153, ACL Findings 2023) — multi-query attention and cross-attention pruning recover FiD accuracy at 7x faster inference; essential before deploying FiD in production
  • REALM: Retrieval-Augmented Language Model Pre-Training (arXiv:2002.08909, ICML 2020) — Guu et al. show how to incorporate retrieval during pre-training rather than only at inference; provides the upstream motivation that FiD builds on
  • Atlas: Few-shot Learning with Retrieval Augmented Language Models (arXiv:2208.03299, JMLR 2023) — Izacard et al.'s own extension of FiD to few-shot settings with joint retriever and reader training, the most complete synthesis of the line of work