Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI
When I look back at the DocFinQA entry — where retrieval-based pipelines and long-context LLMs both collapsed on SEC filings with 123K-token contexts — the question I left hanging was why. This paper by Liu et al. (TACL 2024, arXiv:2307.03172) is the mechanistic answer, and it turns out the failure mode is simpler and more stubborn than I would have expected.
The paper
"Lost in the Middle: How Language Models Use Long Contexts" by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang runs two targeted experiments: multi-document question answering over NaturalQuestions-Open (with 10, 20, and 30 retrieved documents) and synthetic key-value retrieval (with 75, 140, and 300 pairs). In each experiment they systematically vary where the relevant document or key-value pair sits within the input context — beginning, middle, or end — while holding everything else fixed. The finding is clean: performance traces a U-shaped curve with the trough at the middle of the context, and the curve appears across every model tested.
Key ideas
- The U-shape is real and consistent. In the 20-document QA setting, performance at the first position was roughly 75% and degraded to around 55% at position 10 before recovering to about 72% at position 20 — a ~20-point gap between the edges and the center.
- All the models follow the same pattern. The tested models span closed and open, small and large: GPT-3.5-Turbo (4K and 16K), GPT-4, Claude-1.3 (8K and 100K), MPT-30B-Instruct, and LongChat-13B. The U-curve showed up in every one of them, including models explicitly marketed for extended context windows.
- Even Claude-1.3-100K isn't immune. The 100K-context variant behaved like the others. A long context window does not mean the model actually attends uniformly across it.
- The closed-book baseline sets a sobering floor. GPT-3.5-Turbo without any documents answered 56.1% of NaturalQuestions correctly; with oracle access to just the one relevant document it hit 88.3%. But at the worst middle positions in the 20-document setting, performance dropped below the closed-book baseline — meaning adding more context was actively harmful.
- Encoder-decoder models (Flan-T5-XXL, Flan-UL2) are more robust within their training length but revert when contexts exceed it. The architectural difference matters, but both still degrade at scale.
- The root cause is causal attention masking. Each token can only attend to preceding tokens, so positions at the very beginning accumulate more total attention weight across the model than positions in the middle. Recency effects pull the end of context up as well.
What holds up — and what doesn't
The experimental design here is admirably clean: position is the only variable being manipulated, the tasks are standard benchmarks, and the finding replicates across a wide range of model families. I have no quarrel with the core result.
What I find less convincing is the framing of the key-value retrieval task as a meaningful proxy for real use. UUID-to-UUID lookups test whether a model can parrot back a memorized string, not whether it can do anything requiring reasoning. The U-curve shows up there too, which strengthens the position-bias claim, but it also means the paper is conflating two different phenomena: retrieval accuracy on exact-match tasks and reasoning quality over relevant passages. I would want to know whether the U-shape gets worse or better when the relevant document requires multi-step inference before the final answer, not just verbatim regurgitation.
There is also a gap the authors mostly acknowledge but don't close: they never test whether instruction fine-tuning or RLHF changes the position sensitivity, only whether a larger context window does. Given that the root cause is architectural (causal masking), I suspect instruction tuning won't fix it, but the paper doesn't confirm this.
Why this matters for finance AI
This paper provides the mechanistic explanation for an empirical pattern I keep running into. DocFinQA collapsed on long SEC filings. IRCoT and FLARE both retrieve multiple passages and concatenate them before reasoning. Every RAG pipeline I've looked at in a finance context dumps retrieved passages sequentially into the prompt and hopes the model will attend to the right one.
The implication for Beancount agents is concrete. If an agent retrieves ten ledger entries as context, the entries in positions 3–7 are at highest risk of being ignored or hallucinated around. This is not a retrieval problem — it is a presentation problem. Two responses follow from this paper: either put the most diagnostically relevant entries first (and last), or don't concatenate at all and reason over one passage at a time.
The finding also complicates the long-context-LLM narrative. Every quarter a new model announces a larger context window. This paper says the window being long doesn't mean what you think it means if you are uniformly distributing evidence across it. A 128K-context model that buries the relevant transaction in position 60K is worse than a 4K-context model that retrieves precisely the right passage.
For write-back safety, the implications are uncomfortable: if the model is asked to summarize a ledger session and the relevant "do not post this transaction" policy rule appears in the middle of a long system prompt, the model may act as though it never read that rule.
What to read next
- "Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding" (Zhang et al., arXiv:2403.04797) — proposes Multi-scale Positional Encoding (Ms-PoE) as a training-free fix via RoPE scaling; claims up to 3.8-point improvement on Zero-SCROLLS, directly addressing the U-curve.
- "Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training" (arXiv:2311.09198) — takes the opposite approach and trains the model to be explicitly position-agnostic; the comparison with Ms-PoE clarifies whether fine-tuning or inference-time tricks are the better lever.
- "Mitigate Position Bias in Large Language Models via Scaling a Single Dimension" (arXiv:2406.02536) — identifies the specific positional hidden states dimension responsible for the bias and scales it without retraining; the most surgical fix proposed so far, relevant to deploying existing models without retraining.
