Skip to main content

GraphRAG: From Local to Global Query-Focused Summarization

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

Microsoft's GraphRAG paper landed in April 2024 and quickly became the go-to reference for anyone asking whether knowledge graphs could rescue RAG from its most obvious failure mode: questions that require synthesizing an entire corpus rather than retrieving a specific passage. I'm reading it now because the prior log on FinAuditing exposed how LLMs struggle with multi-document XBRL structures — and GraphRAG's community-summary approach is the most prominent existing answer to exactly that kind of global reasoning problem.

The paper

2026-06-04-graphrag-local-to-global-query-focused-summarization

"From Local to Global: A Graph RAG Approach to Query-Focused Summarization," by Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson (Microsoft, arXiv:2404.16130), proposes a two-stage LLM-driven pipeline for answering what the authors call "global sensemaking questions" — queries like "What are the main themes in this dataset?" that standard vector RAG cannot answer because no single passage contains the answer.

The approach proceeds in two phases. During indexing, an LLM extracts entities, relationships, and claims from every text chunk, assembles them into a weighted entity graph, and then runs Leiden community detection to partition the graph into a hierarchy of related clusters, generating a natural-language summary for each community at every level. At query time, each community summary independently generates a partial answer (the map step), these partial answers are ranked by helpfulness score and assembled up to the context window limit (the reduce step), and the result is a final synthesized response.

Key ideas

  • Leiden hierarchical community detection structures the corpus into four coarseness levels (C0–C3), letting users trade response depth for token cost — root-level summaries required 97% fewer tokens than processing source text directly.
  • On two test corpora — podcast transcripts (~1M tokens, 8,564 entities, 20,691 relationship edges) and news articles (~1.7M tokens, 15,754 entities, 19,520 edges) — GraphRAG achieved 72–83% comprehensiveness win rates and 62–82% diversity win rates against vector RAG in LLM-judged pairwise comparisons.
  • The map-reduce design avoids long-context LLM calls at query time: community summaries are precomputed, so retrieval becomes fetching a summary rather than re-processing raw documents.
  • The paper benchmarks six conditions: four GraphRAG hierarchy levels, text summarization (TS), and semantic search (SS). Global GraphRAG conditions consistently outperform SS on sensemaking questions; SS performs better on specific lookup queries.
  • Claim extraction experiments found global conditions extracted 31–34 average claims per response versus 25–26 for vector RAG, suggesting broader topical coverage independent of the LLM judge's scoring preferences.
  • The pipeline requires no domain-specific schema or ontology — entity extraction, relationship labeling, and community summarization all come from prompted inference alone.

What holds up — and what doesn't

The core architectural insight is correct: cosine-similarity RAG cannot answer corpus-level questions because there is no single chunk that represents the whole. GraphRAG's precomputed community summaries are a principled workaround, and the Leiden-based hierarchy is a real design choice that lets you navigate from coarse global summaries to fine-grained cluster summaries depending on cost tolerance.

But the evaluation has serious problems. A recent independent study (arXiv:2506.06331) audited the LLM-as-judge methodology used by GraphRAG and its successors and found three systematic biases: position bias (win rates shift by over 30% simply by swapping which answer appears first in the prompt), length bias (a 25-token difference in a 200-token answer creates a 50-point swing in win rate), and trial bias (identical evaluations produce contradictory outcomes across runs). After correcting for these, the claimed performance advantages collapse — LightRAG's reported 66.7% win rate over naïve RAG corrects to 39.06%. GraphRAG's own 72–83% comprehensiveness numbers almost certainly suffer from the same methodology.

The indexing cost is also a genuine obstacle. One practitioner analysis cited index construction costs reaching $47.9 against GPT-4o for moderate-sized corpora. Microsoft's own LazyGraphRAG variant, released as a follow-up, reduces this to 0.1% of full GraphRAG cost by deferring graph extraction to query time — which is an implicit acknowledgment that the original indexing budget is impractical for many real deployments.

The two evaluation corpora are also narrow: two English-language datasets totaling 1–1.7M tokens each. The authors acknowledge generalization to other domains and scales is unknown. For structured or semi-structured data — financial filings, ledger exports — the entity extraction prompts optimized for narrative text may miss the tabular and hierarchical relationships that matter most in practice.

Why this matters for finance AI

A Beancount ledger is exactly the corpus where global sensemaking queries arise naturally: "What are my largest spending categories over the past three years?" or "Which vendor accounts have grown faster than 20% year-over-year?" Standard RAG cannot answer these because no single entry contains the answer — the agent needs to synthesize across thousands of transactions.

GraphRAG's community-summary approach maps onto this: if the knowledge graph nodes are accounts, payees, and transaction categories, and the edges are co-occurrence or parent-account relationships, then community summaries become precomputed aggregated views over the ledger. The hierarchy also mirrors how Beancount's account tree already structures data — Assets, Expenses, and Income decompose recursively, which is a natural fit for Leiden-style hierarchical clustering.

That said, the evaluation bias findings are a warning: the impressive win rates in the paper may not hold under rigorous controlled testing, and the indexing cost makes this a heavier engineering bet than it appears. For Beancount specifically, structured aggregation — SQL-style queries or pandas over the exported ledger — may outperform LLM-driven community summarization for deterministic analytics. GraphRAG's value would be highest for narrative-heavy questions, like reasoning over transaction memos and vendor names at scale, where there is genuine ambiguity that structured queries cannot resolve.

  • LazyGraphRAG (Microsoft Research blog, 2024) — Microsoft's cost-reduced variant that defers graph extraction; directly relevant to whether GraphRAG's approach is deployable at real ledger scale without prohibitive indexing costs
  • "How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG" (arXiv:2506.06331) — the systematic bias audit; essential reading before accepting any win-rate number from LLM-as-judge evaluations of summarization methods
  • "Towards Verifiably Safe Tool Use for LLM Agents" (arXiv:2601.08012, ICSE 2026) — the next item on the reading list; shifts from summarization to write-back safety, which is the more pressing unsolved problem for Beancount agents