Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG
I've been thinking about the lost-in-the-middle problem ever since writing the log on Liu et al.'s original finding: pass a long context to an LLM, and it will reliably ignore evidence buried in the middle. "Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization" (Hsieh et al., ACL Findings 2024, arXiv:2406.16008) offers the most direct and practical fix I've seen: a training-free inference-time calibration that subtracts out the model's positional bias from its attention weights, recovering up to 15 percentage points of RAG accuracy.
The paper
Hsieh et al. start from a diagnostic observation: LLMs — even those trained on long contexts — exhibit a persistent U-shaped attention pattern. Tokens at the beginning and at the end of the input receive disproportionately high attention regardless of whether they are relevant, while tokens in the middle are systematically underweighted. The authors connect this empirically to the lost-in-the-middle accuracy dip rather than treating it as a separate phenomenon.
Their fix is elegant in concept. They decompose attention into two additive components: relevance (what we want) and positional bias (what we don't). To isolate the bias term, they pass a "dummy" document — uninformative filler content — through the same context at each position and record the resulting attention distribution. That dummy-document attention approximates the pure positional prior. Subtracting it from the real attention scores leaves a residual that better reflects true relevance:
Calibrated attention = Attn(document, k) − Attn(dummy, k)
The rescaled scores are then used to re-rank or re-weight retrieved documents before the final answer generation step. Critically, no training is required. The calibration is applied at inference time to the last 16 decoder layers and all attention heads. The cost is O(K) additional forward passes, where K is the number of retrieved documents — non-trivial but predictable.
Key ideas
- The U-shaped attention bias is intrinsic to the model architecture and persists even in models explicitly trained with long-context objectives.
- Passing a dummy (empty/noise) document through the same retrieval context isolates the positional prior; subtracting it removes bias without any finetuning.
- Recall@3 on NaturalQuestion (K=20, gold document placed in the middle) jumps from 20.52% to 68.32% with calibration; at K=10, from 36.38% to 74.27%.
- End-to-end QA accuracy improves by 6–15 percentage points when the gold document is mid-context; improvements hold in 22 of 24 experimental configurations.
- The method outperforms six comparison baselines: vanilla attention, query-generation ranking, relevance-generation prompting, attention sorting (Peysakhovich & Lerer 2023), prompt reordering, and LongLLMLingua-rk.
- The method was evaluated on NaturalQuestion (2,655 real queries over Wikipedia) and SynthWiki (990 synthetic GPT-4-generated entries).
What holds up — and what doesn't
The core result is striking and I believe it. A 20.52%→68.32% Recall@3 gap for mid-context gold documents is not the kind of number that evaporates under scrutiny — it's measuring something real about how attention is distributed. The training-free design is a genuine practical advantage: you can drop this on top of any existing RAG pipeline without touching the model weights.
That said, I have some reservations. First, the "dummy document" approach assumes that positional bias is roughly position-separable and additive — a linear decomposition that the authors themselves flag as potentially oversimplifying. Real attention bias may interact with content in non-linear ways. Second, the O(K) extra forward passes are priced as "acceptable" but never benchmarked for latency or cost. In a production system with K=20 retrievals, you're running 21 forward passes instead of 1 per query. For a Beancount agent triaging hundreds of transactions, this multiplier matters.
Third — and this is the most interesting limitation — the authors note that positional bias might actually be useful for certain tasks. Recency bias, for instance, might be what makes a model weight recent ledger entries correctly over older ones. Removing bias indiscriminately could hurt tasks where position is a valid signal. This is acknowledged but not studied.
Finally, the experiments use NaturalQuestion and a synthetic dataset. Finance-specific documents — dense tables, multi-year filings, ledger entries with repetitive structure — are very different from open-domain Wikipedia passages. The calibration would need to be validated on those distributions before claiming it will work for financial RAG.
Why this matters for finance AI
The direct connection is clear: every log since DocFinQA has been circling the same problem. When a Beancount agent retrieves 20 relevant ledger entries to answer a question like "reconcile March against the bank statement," entries from the middle of the retrieved window will be systematically underattended relative to entries at the top and bottom of the context. That's not a retrieval failure — it's a generation-side failure that no amount of retrieval-ranking improvement will fix.
The found-in-the-middle calibration is a plausible mitigation that requires no retraining of the underlying model and could be applied directly inside the generation step of any ledger QA pipeline. The O(K) cost concern is real but manageable — a 20-document retrieval window with a moderately sized model is still well within practical bounds. What I'd want to see before deploying it is a validation on Beancount-structured data specifically: does the positional correction help uniformly, or does it inadvertently suppress the recency signal that makes recent transactions more trustworthy than old ones?
The broader principle — that attention mechanisms encode positional priors independently of content relevance, and that those priors can be calibrated away without retraining — is one worth keeping. It opens the door to similar calibrations for other biases: token-frequency bias, input-length normalization, verbosity bias in generation.
What to read next
- "Mitigate Position Bias in LLMs via Scaling a Single Hidden States Channel" (arXiv:2406.02536, ACL Findings 2025) — proposes scaling a single hidden-state dimension rather than subtracting attention scores; worth comparing to found-in-the-middle's approach directly.
- "Large Language Models for Anomaly and Out-of-Distribution Detection: A Survey" (arXiv:2409.01980, NAACL 2025) — next on the reading list; ties together the AnoLLM, CausalTAD, and AD-LLM thread into a unified taxonomy.
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (arXiv:2307.03172, TACL 2023) — the original diagnosis that found-in-the-middle is responding to; essential background reading.
