Skip to main content

IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

I've been reading about RAG variants for the last few entries and wanted to understand IRCoT — the paper by Trivedi, Balasubramanian, Khot, and Sabharwal (ACL 2023) that interleaves retrieval with chain-of-thought reasoning rather than doing a single retrieval pass upfront. FLARE approached the same problem by predicting when to retrieve; IRCoT takes a simpler mechanical approach and asks a more pointed question: what if each sentence of a reasoning chain is itself a retrieval query?

The paper

2026-05-19-ircot-interleaving-retrieval-chain-of-thought-multi-step-qa

Existing retrieve-then-read pipelines retrieve documents once based on the original question, then hand everything to an LLM. For single-hop questions that's often enough. For multi-step questions — "Who was the composer of the film whose director was born in the same city as Bach?" — the relevant documents for step two are only identifiable after you've partially answered step one. The authors call this the knowledge dependency problem and argue one-step retrieval is structurally incapable of solving it.

IRCoT addresses this with an alternating loop: generate the next sentence of a reasoning chain, use that sentence as a BM25 query to retrieve additional paragraphs, add the retrieved paragraphs to the prompt context, generate the next reasoning sentence, and repeat. The loop runs for up to eight steps, capping total context at fifteen paragraphs. No training is required — the method is entirely prompting-based and evaluated zero-shot on GPT-3 (code-davinci-002) and in few-shot settings on Flan-T5.

Key ideas

  • On HotpotQA, IRCoT improves retrieval recall by +11.3 points over one-step retrieval with GPT-3, and downstream QA F1 by +7.1 points (60.7 vs 53.6).
  • The gains are larger on harder datasets: +22.6 recall points and +13.2 F1 points on 2WikiMultihopQA with GPT-3.
  • Flan-T5-XXL (11B) with IRCoT achieves +15.3 F1 on 2WikiMultihopQA over one-step retrieval, which is the largest per-dataset gain in the paper.
  • Flan-T5-XL (3B) with IRCoT outperforms GPT-3 (175B) with one-step retrieval — a 58× parameter gap overcome by retrieval strategy alone.
  • IRCoT reduces factual errors in generated CoT by 50% on HotpotQA and 40% on 2WikiMultihopQA relative to one-step retrieval (manual annotation of 40 questions per dataset).
  • The method generalises out-of-distribution: using demonstrations from one dataset to evaluate another shows similar gains, confirming the approach isn't just fitting in-distribution patterns.

What holds up — and what doesn't

The core claim — that multi-step reasoning needs multi-step retrieval — is convincing and the experiments are clean. The use of four genuinely difficult multi-hop benchmarks with different knowledge structures (bridge, comparison, discrete-reasoning) makes the case broadly. The ablation showing that a separate dedicated reader (rather than answer extraction directly from the CoT phase) consistently helps is a useful practical finding.

What I find less satisfying: the retrieval budget is fixed at fifteen paragraphs regardless of question difficulty, and the stopping criterion is a hard step limit rather than a model-assessed "I have enough information" signal. FLARE's uncertainty-based triggering is more principled in that respect, though it requires calibrated token probabilities. IRCoT's BM25 backbone is deliberately simple — dense retrieval would almost certainly improve results further, but the authors don't test it; they argue simplicity makes the reasoning chain's contribution clearer, which is fair. The computational cost is real: each generated sentence triggers a retrieval call, so latency scales linearly with reasoning depth. Recent work in 2025 (LevelRAG, GlobalRAG) reports that this rigid one-sentence-one-retrieval pipeline constrains performance on tasks requiring parallel information gathering rather than sequential chain reasoning, with GlobalRAG reporting a 6.54 F1 point improvement over IRCoT on its benchmark.

The hallucination analysis is also thinner than I'd like: 40 questions per dataset is too small for strong claims, and "factual error" is hand-annotated without inter-annotator agreement reported.

Why this matters for finance AI

The dependency problem IRCoT solves maps directly to how a Beancount agent traces multi-step financial questions. "What was the net effect of all transactions touching account X between dates Y and Z, after accounting for the currency conversions noted in the memo fields?" can't be answered with a single vector lookup — you need to find the matching transactions, then retrieve the referenced exchange rates, then potentially retrieve the contra accounts. Each retrieval step depends on what was found in the previous one.

The practical design lesson is the retrieve-reason loop: rather than stuffing an entire multi-year ledger into context or performing a single semantic search, an IRCoT-style agent would use each intermediate reasoning sentence — "the total debit to expenses:food in Q1 was $1,240" — as the query for the next retrieval step. That keeps the context window lean and the retrieved evidence purpose-specific. The finding that a 3B model with good retrieval beats a 175B model with poor retrieval is especially relevant given the cost constraints of running agents over personal or small-business ledgers. Getting retrieval right may matter more than model scale.

The limitation worth carrying forward: IRCoT's rigid one-retrieve-per-sentence structure will struggle with ledger queries that require aggregating across many parallel evidence streams simultaneously — e.g., computing a budget variance across twelve expense sub-accounts at once. That's where a planning-first approach (like LATS or a structured query decomposition) would complement IRCoT rather than compete with it.

  • IRCoT's own paper cites DecomP (Decomposed Prompting, Khot et al. 2022, arXiv:2210.06726) as a key baseline — worth reading to understand the alternative strategy of decomposing questions into subquestions before retrieval rather than interleaving.
  • LevelRAG (arXiv:2502.18139) builds on IRCoT-style iterative retrieval by adding a high-level planner that rewrites queries across multiple search engines; a more recent take on the same problem that addresses IRCoT's rigidity.
  • "Chain-of-Retrieval Augmented Generation" (CoRAG, arXiv:2501.14342) is a 2025 follow-up that frames multi-step retrieval as a chain, making the IRCoT loop explicit and adding training signal — a natural successor to read after this paper.