StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points
The running complaint against RAG in production is that retrieval is a blunt instrument when the relevant facts are scattered across dozens of documents in incompatible formats. StructRAG (Li et al., ICLR 2025) takes a direct swing at this by converting retrieved text into a task-appropriate structure — table, graph, catalogue, algorithm, or plain chunk — before reasoning over it. It is motivated by a cognitive theory claim: that humans naturally reshape raw information into structured representations when tackling complex reasoning tasks. Whether that framing is more metaphor than mechanism, the empirical numbers are worth examining carefully.
The paper
StructRAG proposes an inference-time pipeline with three modules. First, a hybrid structure router (Qwen2-7B-Instruct, fine-tuned with DPO on 900 synthetic preference pairs) predicts which of five structure types best fits the incoming question and its documents. Second, a scattered knowledge structurizer (Qwen2-72B-Instruct) rewrites the retrieved chunks into that chosen format. Third, a structured knowledge utilizer decomposes the question into sub-questions, retrieves the relevant structured fragments, and generates the final answer. The five structure types are: table (statistical comparisons), graph (multi-hop chains, encoded as head–relation–tail triples), algorithm (planning tasks, written as pseudo-code), catalogue (summarization, hierarchical numbering), and chunk (simple single-hop, the default RAG fallback).
The authors evaluate primarily on the Loong benchmark (EMNLP 2024 Oral), a multi-document QA benchmark spanning financial reports, legal cases, and academic papers, with inputs ranging from 10K to 250K tokens, covering four task types: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
Key ideas
- The DPO-trained router reaches 94.38% accuracy at structure-type selection versus 50.04% zero-shot with Qwen2-72B-Instruct — the routing decision is the single most critical component. Ablating the router drops overall LLM score from 60.38 to 45.33.
- At the hardest document-length tier (200K–250K tokens), StructRAG scores 51.42 vs. Long-Context at 28.92 and RAG at 29.29 — a ~22-point gap that widens as context grows. The standard "just stuff everything in" approach deteriorates sharply while StructRAG degrades more gracefully.
- GraphRAG, despite also imposing structure, scores 40.82 overall LLM score on Loong vs. StructRAG's 69.43, and it takes 217.1 minutes per query vs. StructRAG's 9.7 minutes. Pre-building a global knowledge graph is both slower and less accurate than picking the right format on demand.
- On Podcast Transcripts (open-ended summarization), StructRAG achieves a 95.75% pairwise win rate over Long-Context, suggesting structured synthesis outperforms full-context approaches even on less structured source material.
- The exact-match (EM) scores consistently lag behind LLM-judged scores because structurization changes surface wording (e.g., "$1,308,463" becomes "138463" in a table cell), creating a systematic token-mismatch problem that penalizes automated evaluation.
What holds up — and what doesn't
The core result is real and the ablation story is clean: routing matters most, followed by structurization, followed by utilization. The improvement at long document lengths is the strongest finding — 22 points at 200K tokens is not noise.
That said, I have three reservations. First, the benchmark coverage is thin. StructRAG reports only Loong and Podcast Transcripts. Standard multi-hop benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, NQ) are notably absent, which makes it impossible to assess how StructRAG compares against the large body of prior retrieval research on those established splits. Reviewers at ICLR presumably raised this; the paper offers no direct response in the published version.
Second, the evaluation model is GPT-4. LLM-as-judge scoring is susceptible to length bias and stylistic preferences that may favor outputs from the same structurization process, especially when the judge has been trained on similar structured text. The EM metric is a corrective, but the authors frame it as a limitation of the metric rather than evidence of a problem with the method.
Third, StructRAG is tested with a large backbone (Qwen2-72B-Instruct for the structurizer and utilizer). It is unclear how much of the gain comes from routing versus simply calling a powerful model to rewrite and summarize. An ablation against a same-size direct-answer baseline would settle this, but it is not presented.
Why this matters for finance AI
Beancount ledgers are the canonical instance of the "scattered information" problem. A single reconciliation question — "why did my net assets drop in Q3?" — may require reading transaction entries from three accounts, cross-referencing a balance sheet report, and tracing a multi-step correction chain. These map nearly one-to-one onto StructRAG's structure types: tables for balance comparisons, graphs for transaction chains, catalogues for period summaries.
The routing insight is especially applicable. A query-focused Beancount agent should not always dump chunks into context; it should first ask what shape the answer requires. A balance-trend question needs a table. An "explain this reimbursement chain" question needs a graph. An "summarize this year's spending" question needs a catalogue. Wiring this routing decision explicitly — even with a small model — could dramatically reduce the hallucination and number-mangling that plagues current ledger QA attempts.
The 217-to-9.7-minute latency story also matters in practice. For an interactive Beancount agent, GraphRAG's pre-indexing cost is prohibitive for frequently updated ledgers; StructRAG's inference-time approach fits the write-heavy, query-sparse ledger use case better.
The caveat: StructRAG's structurizer is a large LLM call on every query. For long ledger histories, that inference cost could become significant. Token-efficient structurization — perhaps a smaller fine-tuned model — is an open engineering question.
What to read next
- From Local to Global: A Graph RAG Approach to Query-Focused Summarization (Edge et al., 2024, arXiv:2404.16130) — Microsoft GraphRAG uses community summaries for global queries; understanding where StructRAG's inference-time structurization beats GraphRAG's pre-indexing is the key architectural trade-off to nail down.
- FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark (arXiv:2510.08886) — tests 13 LLMs on XBRL filings with hierarchical tables; a direct test of whether StructRAG's table and catalogue structures transfer to the structured filing format that Beancount ledgers resemble.
- InvestorBench: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent (arXiv:2412.18174, ACL 2025) — evaluates agents on live financial decisions, which would let us measure whether StructRAG's structured reasoning actually helps downstream decision quality beyond single-hop QA accuracy.
