HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs
HippoRAG, published at NeurIPS 2024, is a retrieval-augmented generation framework that uses a knowledge graph and Personalized PageRank to mimic how the human hippocampus indexes long-term memories. I'm reading it because the core problem it addresses—retrieving information distributed across many documents and connected only through chains of facts—is exactly the problem a Beancount agent faces when answering questions about multi-year ledger histories.
The paper
Jiménez Gutiérrez, Shu, Gu, Yasunaga, and Su identify a structural failure mode in standard RAG: if the passages that answer a question don't share any terms with the query itself, embedding-based retrieval simply won't find them. They call this the path-finding problem—you need to traverse a chain of entities, not just match a query string against a document vector.
Their solution, HippoRAG, mirrors the hippocampal indexing theory of human memory. An LLM (GPT-3.5-turbo) extracts open information extraction (OpenIE) triples from each passage offline, building a schemaless knowledge graph of noun-phrase nodes and relational edges. A dense retrieval encoder adds synonymy edges between semantically similar nodes (cosine similarity > 0.8). At query time, the system extracts named entities from the query, seeds a Personalized PageRank (PPR) propagation from those nodes, and ranks passages by aggregating PPR probabilities across their member nodes. A "node specificity" weight—the inverse of the number of passages a node appears in—functions as a graph-native IDF.
Key ideas
- Graph-native IDF: weighting rare nodes more heavily in PPR propagation is the insight that makes the system work. Without it, common entities like "company" or "the" would dominate retrieval. Ablations show removing node specificity drops MuSiQue Recall@2 from 40.9 to 37.6.
- Single-step beats iterative: HippoRAG without iteration achieves comparable recall to IRCoT (which runs multiple retrieval rounds interleaved with chain-of-thought reasoning), while being 10–30× cheaper and 6–13× faster at query time.
- Massive gains on 2WikiMultiHopQA: Recall@5 improves from 68.2 (ColBERTv2) to 89.1 (HippoRAG). The gap reflects exactly the path-finding structure of that benchmark's questions.
- Modest gains on MuSiQue: Recall@5 improves only from 49.2 to 51.9. MuSiQue is harder; many questions require reasoning that the graph topology can't fully capture.
- HotpotQA regression: HippoRAG underperforms ColBERTv2 on HotpotQA (Recall@2: 60.5 vs. 64.7). HotpotQA questions are generally solvable from two closely related passages, which plays to embedding retrieval's strengths rather than graph traversal.
- OpenIE quality is the bottleneck: ablations show using Llama-3-70B for extraction degraded performance due to formatting errors, while Llama-3-8B was competitive with GPT-3.5-turbo. Off-the-shelf extraction is fragile.
What holds up—and what doesn't
The result is real: on 2WikiMultiHopQA, which is specifically designed around multi-hop chains, graph traversal outperforms dense retrieval by a wide margin. The PPR approach is elegant—seeding propagation at query entities and letting the graph fill in the neighborhood is a principled way to handle distributional mismatch between query and supporting passages.
What I find less convincing is the neurobiological framing. The paper draws an analogy between PageRank and hippocampal CA3 activity, citing a cognitive science study that found correlation between human word recall probabilities and PageRank scores. That's a correlational observation from psycholinguistics, not a derivation. PPR was not designed from hippocampal physiology—calling this "neurobiologically inspired" is branding more than mechanism.
The efficiency claim also deserves scrutiny. Single-step HippoRAG is 10–30× cheaper online than IRCoT—but the offline indexing cost (running GPT-3.5-turbo to extract OpenIE triples from every document) is front-loaded and substantial. For a corpus that changes frequently, this cost is paid again on updates. The paper doesn't report total indexing cost.
Finally, the benchmarks are medium-scale: 6K–11K passages and under 100K graph nodes. The authors explicitly list scalability as an open question. Whether PPR holds up on hundreds of thousands of ledger entries spanning decades is unvalidated.
Why this matters for finance AI
A Beancount ledger is a chain of facts: account hierarchies, transaction references, rule cross-references, budget allocations. A question like "which 2022 expenses fall under the same budget category as invoice #INV-2019-0042?" requires traversing the graph of accounts, transactions, and categories—exactly the path-finding task where standard RAG fails.
HippoRAG's indexing design maps naturally: extract entity-relation triples from ledger entries (account, amount, counterparty, rule), build a graph, then run PPR seeded at query entities. The node specificity weighting would naturally down-weight generic nodes like "expenses" or "assets" and up-weight rare vendor names or account codes, which is precisely what you want.
The practical blocker for Beancount is the incremental update cost. Every new transaction adds nodes and edges; re-running OpenIE extraction on new entries is tractable, but PPR complexity scales with graph size. The HippoRAG 2 follow-up (arXiv:2502.14802) claims a 7% further improvement in associative tasks, but the scalability question remains open. For a ledger with millions of transactions, this is the engineering problem that would need to be solved before deploying this approach.
What to read next
- GraphRAG (Edge et al., arXiv:2404.16130) — Microsoft's alternative that summarizes graph communities rather than running PPR; better for broad thematic questions, and a useful contrast to HippoRAG's entity-chain approach.
- RAPTOR (Sarthi et al., arXiv:2401.18059) — recursive abstractive tree organization for RAG; HippoRAG beats it on multi-hop benchmarks, but RAPTOR may handle long-range summarization tasks better where graph traversal isn't the right framing.
- IRCoT (Trivedi et al., arXiv:2212.10509) — the iterative retrieval baseline that HippoRAG claims to match at lower cost; worth reading to understand what the 10–30× efficiency claim is actually comparing against.
