Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

May 26, 2026 · 5 min read

Mike Thrift

Marketing Manager

Retrieval-augmented generation lives or dies by how well the generator can synthesize evidence spread across multiple documents. Izacard and Grave's 2021 EACL paper, "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering," proposes a deceptively simple architectural fix — encode passages independently, fuse them all in the decoder — that outperforms the then-dominant RAG framework by a significant margin. I'm reading it now because the design principle maps directly to ledger QA: before deciding how to retrieve entries in Beancount agents, it's worth understanding which fusion strategy actually works.

The paper

2026-05-26-fusion-in-decoder-passage-retrieval-generative-qa

Lewis et al.'s original RAG (arXiv:2005.11401) marries a dense retriever with a BART generator but forces the generator to condition on a single retrieved passage at a time, marginalizing over passages either per-sequence (RAG-Sequence) or per-token (RAG-Token). Izacard and Grave identified this as the binding constraint: a model that can only see one passage at a time cannot easily triangulate across evidence that is scattered across documents.

Their FiD (Fusion-in-Decoder) solution is elegant. Each retrieved passage is concatenated with the question, then encoded independently by T5's encoder. The encoder runs once per passage — fully parallelizable. The decoder then performs cross-attention over the concatenation of all passage representations simultaneously. The encoder complexity scales linearly with the number of passages; the decoder, crucially, can attend across passage boundaries during every generation step. The paper uses T5-base and T5-large as the generator backbone.

Key ideas

FiD-large with 100 retrieved passages achieves 51.4% exact match on Natural Questions and 67.6% on TriviaQA open, compared to RAG-Sequence's 47.5% and 56.1% respectively — gains of roughly 4 and 11 points.
Performance on Natural Questions scales monotonically with passage count: 37.3% at 1 passage, 48.8% at 10, 50.8% at 50, 51.4% at 100. The marginal return diminishes but never reverses.
TriviaQA improves 6% and NaturalQuestions 3.5% when scaling from 10 to 100 passages — evidence that the decoder is genuinely aggregating, not just picking the top passage.
The encoding step is cheap to parallelize: each (question, passage) pair is processed independently, so wall-clock time scales sub-linearly with hardware.
FiD-base with 770M parameters surpasses T5-11B closed-book (44.1% vs. 36.6% on NQ), demonstrating that retrieval makes smaller models punch far above their weight.

What holds up — and what doesn't

The core result is robust and has been replicated extensively. The architectural insight — independent encoding, joint decoding — is genuinely clean: it avoids the quadratic self-attention blowup that would result from naively concatenating all passages before the encoder, while still giving the decoder global context over all retrieved evidence.

The limitation the paper barely acknowledges is that the decoder's cross-attention is the bottleneck at inference time. Cross-attention must load all encoder key-value pairs per decoder layer per generation step, and those key-value tensors grow linearly with passage count. A 2023 follow-up, FiDO (arXiv:2212.08153), showed that replacing multi-head attention with multi-query attention and pruning cross-attention layers yields a 7x inference speedup with minimal accuracy loss — which implies the original FiD decoder is substantially over-engineered for what the task requires.

There is also a calibration gap the paper does not explore: it reports exact match, which rewards systems that happen to produce the precise canonical answer string. For factual synthesis tasks — summarizing findings across multiple passages rather than extracting a span — exact match understates errors and overstates confidence. In finance settings, where a wrong number in an otherwise correct sentence is a serious failure, exact match is the wrong metric entirely.

Why this matters for finance AI

Beancount ledger QA is a multi-passage retrieval problem by nature. A question like "What did I spend on travel in Q3 across all accounts?" requires synthesizing dozens of transaction entries from different dates, accounts, and commodity types. FiD's core finding — that generative models can aggregate across many retrieved passages, and that performance improves with more context — is directly encouraging.

The practical design implication is concrete: when building a Beancount QA layer, retrieving more candidate entries (50–100 rather than the usual top-5) and giving the generator joint access to all of them is likely better than relying on re-ranking to pick one right answer. The FiD architecture also maps cleanly to ledger structure: each transaction entry can be encoded independently (cheap, parallelizable) before the decoder synthesizes across all of them.

The inference cost concern is real for production deployments, but FiDO's follow-up shows it is solvable at the architecture level without accuracy penalty. The more pressing limitation for finance agents is that FiD is designed for factoid QA with short generative outputs. Ledger analysis often requires multi-step arithmetic — adding up amounts, computing ratios — and FiD's generator does not inherently route that to an interpreter. Combining FiD-style fusion with a PAL-style code-generation head is the natural next step for numeric accuracy.

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next