Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs
The question I keep circling back to when designing Beancount agents is this: when your ledger data changes, should you fine-tune the model on the new facts or build a retrieval system? Ovadia et al.'s "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs" (EMNLP 2024, arXiv:2312.05934) gives the cleanest empirical answer I've found, and it cuts sharply against the fine-tuning hype.
The paper
Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha compare two approaches to updating what an LLM knows: unsupervised continual pre-training (the model reads new text and continues next-token prediction) and RAG (the model receives retrieved passages at query time). They test three 7B-parameter models — Llama2-7B, Mistral-7B, and Orca2-7B — across two knowledge domains: a subset of MMLU covering anatomy, astronomy, college biology, and chemistry (knowledge the models likely saw in pretraining), and a custom current-events dataset of 910 multiple-choice questions about U.S. events from August–November 2023, explicitly beyond the models' training cutoffs. The RAG pipeline uses BGE-large-en embeddings over a FAISS index. Fine-tuning runs unsupervised causal LM training on Wikipedia chunks of 256 tokens on 4 A100 GPUs.
Key ideas
- RAG dominates on genuinely new knowledge: On the current-events task, RAG alone scores 0.875 (Mistral) and 0.876 (Orca) against base-model baselines of 0.353–0.481. Unsupervised fine-tuning with paraphrasing reaches only 0.504–0.511 — RAG more than doubled the accuracy gain that fine-tuning achieved on facts beyond the training cutoff.
- Fine-tuning's ceiling is existing knowledge, not new: Even on MMLU subjects the models had already encountered during pretraining, fine-tuning yields only modest gains; RAG still outperforms across all five subjects.
- Paraphrases help, but slowly: GPT-4-generated paraphrases of each training chunk improve fine-tuning results monotonically — 10 versions consistently beats 1 — and the authors suggest this may partially address the Reversal Curse (Berglund et al., arXiv:2309.12288), where models trained on "A is B" fail to generalize to "B is A". They're careful to note the connection warrants further research.
- Catastrophic forgetting is a real cost: Llama2 without data augmentation showed significant accuracy degradation on previously learned tasks after fine-tuning on current events. RAG sidesteps this entirely.
- Combining both doesn't reliably help: Fine-tuning + RAG reached 0.520–0.830 in the current-events condition, sometimes below RAG alone. Fine-tuning appears to interfere with the model's ability to use retrieved context.
What holds up — and what doesn't
The core finding is credible. A 910-question dataset with a clean temporal cutoff is enough to trust the direction of the result: unsupervised fine-tuning is a poor vehicle for injecting genuinely new facts. The evaluation design is clean and the effect sizes are large.
The blind spots are also real. All three tested models are 7B parameters — we don't know whether the fine-tuning gap shrinks or grows with frontier-scale models. More importantly, the fine-tuning method is strictly unsupervised next-token prediction. No LoRA, no instruction tuning, no supervised QA pairs. RAFT (Zhang et al., arXiv:2403.10131) and similar supervised domain-adaptation approaches are more competitive baselines that this paper doesn't engage. The conclusion "fine-tuning loses" is really "unsupervised fine-tuning loses," which is a narrower claim.
The RAG implementation is also modest: basic dense retrieval with FAISS and BGE-large-en, no reranking or query expansion. An appendix note acknowledges that optimal K varies substantially across models and tasks — choosing the wrong number of retrieved passages significantly hurts performance. In production, K-tuning per domain is a non-trivial operational cost.
One claim I'd push back on: the authors frame the paraphrase-helps-fine-tuning finding as potentially ameliorating the Reversal Curse, but their evidence is indirect. The monotonic improvement with paraphrase count could just reflect standard data augmentation benefits rather than any structural fix to bidirectional generalization. The connection is interesting but not established.
Why this matters for finance AI
This is one of the most directly actionable papers for the Bean Labs agenda. A Beancount agent cannot be retrained every time a transaction is added, a rule changes, or a new fiscal year begins. The paper strongly supports treating the ledger as a retrieval corpus rather than fine-tuning material: the factual gains from fine-tuning are modest, the catastrophic forgetting risk is real, and the operational cost of retraining far exceeds the cost of re-indexing.
The paraphrase finding points to something useful even if we set fine-tuning aside. If a domain-specific accounting rule needs to be embedded deeply in a model's behavior — not just retrieved but followed reliably — expressing it in multiple forms (constraint, validation check, worked example of violation) is likely more robust than a single canonical statement. That's how accounting education works, and it's consistent with how Constitutional AI's rule-following studies frame rule coverage.
The catastrophic forgetting result is the clearest practical warning: unsupervised domain adaptation on ledger data can degrade the general reasoning capabilities needed for anomaly detection and query answering. Retrieval sidesteps this at the cost of an index and a retriever — a trade worth making.
What to read next
- The Reversal Curse (Berglund et al., arXiv:2309.12288, ICLR 2024) — the paper Ovadia et al. invoke; explains why LLMs fail at bidirectional implication from training data and frames the fundamental limits of fine-tuning for factual injection.
- RAFT: Adapting Language Model to Domain Specific RAG (Zhang et al., arXiv:2403.10131) — a supervised fine-tuning recipe designed to work with RAG rather than replace it; a more competitive fine-tuning baseline than the unsupervised approach tested here.
- Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge (arXiv:2403.01432) — extends the comparison to long-tail entity knowledge, where RAG again dominates, and proposes Stimulus RAG as a lightweight alternative.
