Skip to main content

Self-RAG: Adaptive Retrieval and Self-Critique for LLMs

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

Standard RAG retrieves every time, whether or not retrieval helps. Self-RAG by Asai et al. (ICLR 2024 Oral) asks a different question: what if the model itself decided when to look something up, and then graded the result? That turns out to matter quite a bit, and the mechanism is clean enough to be worth studying carefully.

The paper

2026-05-09-self-rag-learning-to-retrieve-generate-critique-self-reflection

The core grievance with vanilla Retrieval-Augmented Generation is that it is indiscriminate: retrieve a fixed number of passages for every input, prepend them, and generate. That works well enough when retrieval helps, but it actively hurts when the passages are irrelevant or when the model already has the answer in its weights. The paper introduces Self-Reflective Retrieval-Augmented Generation (Self-RAG), authored by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi (University of Washington and IBM Research).

The key mechanism is a set of four special reflection tokens baked into the model's vocabulary at training time. Retrieve decides whether to call the retriever at all. IsRel (relevance) assesses whether a retrieved passage actually contains useful information for the query. IsSup (support) checks whether the generated claim is fully, partially, or not supported by the passage. IsUse (utility) scores the overall response quality from 1 to 5. The model learns to emit these tokens inline with its normal output — so it critiques its own retrieval and generation in one forward pass.

Training is two-stage: first, a critic model (LLaMA 2, 7B fine-tuned) is trained on roughly 4,000–20,000 labeled examples per token type, reaching over 90% agreement with GPT-4 predictions. That critic then annotates a 150,000-example instruction-output corpus offline, and the generator is trained on this annotated data with reflection tokens treated as ordinary vocabulary. No reinforcement learning is required.

Key ideas

  • The four reflection tokens (Retrieve, IsRel, IsSup, IsUse) give the model a structured internal dialogue about whether evidence is worth trusting — not just a binary retrieve/don't decision.
  • Self-RAG 13B reaches 55.8% on PopQA, 69.3% on TriviaQA, 74.5% on PubHealth, 73.1% on ARC-Challenge, and a Biography FactScore of 80.2 — outperforming ChatGPT and retrieval-augmented Llama2-chat on each of these.
  • Ablations on PopQA show that removing retrieval at test time costs 20.8 percentage points, while removing just the critic costs only 2.9 pp — the retriever is load-bearing; the critique adds calibration on top.
  • At inference time, the weights on critique tokens can be adjusted to trade off citation precision against fluency without any retraining. This makes the model's behavior configurable for different downstream applications.
  • The ICLR 2024 program committee gave Self-RAG oral status (top 1%), which reflects genuine peer recognition of the technical contribution.

What holds up — and what doesn't

The ablation results are convincing. The gap between always-retrieve and no-retrieve is large (20.8 pp); the model clearly learned to discriminate useful retrieval from noise. The IsRel and IsSup tokens add measurable value on top of adaptive retrieval alone. That's a meaningful result, not just a reframing.

What I'm less convinced by is the generalization claim. All five evaluation tasks (PopQA, TriviaQA, PubHealth, ARC-Challenge, ASQA) are short-form or multiple-choice QA — the exact setting where a single retrieved passage can provide a decisive signal. Long-form generation over multi-document contexts, which is where finance tasks live, gets less scrutiny. The Biography FactScore (80.2) is the closest proxy, but biographies are relatively well-structured compared to a sprawling multi-year expense ledger.

There's also a reproducibility catch: the critic model's training labels come from GPT-4. That makes the label quality dependent on a proprietary system and introduces API costs that aren't reported. CRAG (arXiv:2401.15884) later showed that a 0.77B retrieval evaluator — much lighter than Self-RAG's 7B critic — could correct retrieval quality and gain 19.0 pp over standard RAG on PopQA, suggesting the heavy fine-tuned critic may not be necessary. That's a meaningful challenge to the design, even if the core insight about selective retrieval holds.

Finally, the comparison baseline matters. Beating ChatGPT (likely GPT-3.5-turbo, late 2023) and Llama2-chat is a reasonable bar for an open 13B model, but frontier models have moved substantially since then. Whether Self-RAG's adaptive retrieval would beat a well-prompted GPT-4o with a simple retrieve-always setup on these same benchmarks is not addressed.

Why this matters for finance AI

Finance agents over Beancount ledgers face exactly the retrieval discrimination problem Self-RAG addresses. When a user asks "what's my net income this month?", the agent can compute from its loaded context — retrieval might just add noise. When the same user asks "did I record the Q3 contractor invoice?" the agent needs to scan potentially years of entries. Always-retrieve wastes context and risks injecting irrelevant old transactions; never-retrieve misses the lookup.

The IsRel and IsSup tokens map cleanly to ledger validation logic. IsRel: does the retrieved transaction entry actually relate to the query? IsSup: does the retrieved context actually support the generated balance figure, or is the number hallucinated? The utility score (1–5) could inform write-back confidence: only commit a proposed journal entry when the model gives its own reasoning a 4 or 5, and flag the rest for human review.

The reproducibility concern matters here too. For a production accounting agent, depending on GPT-4 to generate training labels is an operational constraint. If a lighter evaluator (à la CRAG) can achieve comparable selective retrieval, that's the more deployable path. The Self-RAG design principles — decide before retrieving, critique after retrieving — remain valuable even if the specific token-training recipe is replaced.

  • CRAG: Corrective Retrieval Augmented Generation (arXiv:2401.15884) — builds on Self-RAG's adaptive retrieval idea with a lighter evaluator and web-search fallback when local retrieval fails; worth comparing directly with Self-RAG on overlapping benchmarks.
  • RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation (arXiv:2404.00610) — focuses specifically on query decomposition for complex multi-hop QA, which is the scenario Self-RAG handles least gracefully.
  • FRAMES: Retrieval and Augmentation for Multi-Hop Evaluation (arXiv:2409.12941) — Google DeepMind benchmark for multi-document RAG that requires chaining several retrieved facts; a natural harder test for Self-RAG-style models.