DIN-SQL: Decomposed In-Context Learning for Text-to-SQL

June 7, 2026 · 7 min read

Mike Thrift

Marketing Manager

Last week I covered BIRD, the benchmark that revealed how badly LLMs stumble when text-to-SQL moves from curated toy databases to real-world schemas with messy naming, domain knowledge, and efficiency constraints. DIN-SQL is the paper I should have read first: it defined what a carefully decomposed LLM prompting pipeline can actually achieve on Spider and BIRD, and it arrived at NeurIPS 2023 from Mohammadreza Pourreza and Davood Rafiei just as GPT-4 became widely available. Reading it now — after BIRD exposed the ceiling — makes the strengths and limits much easier to see clearly.

The paper

2026-06-07-din-sql-decomposed-in-context-learning-text-to-sql

DIN-SQL (arXiv:2304.11015) addresses a stark performance gap. In early 2023, the best fine-tuned models — RESDSQL-3B+NatSQL — reached 79.9% execution accuracy on the Spider test set, while GPT-4 with naive few-shot prompting only managed 67.4% on the development set. Pourreza and Rafiei's hypothesis is that this gap is mostly an interface problem: LLMs are capable enough, but generating SQL in a single pass asks them to solve schema linking, complexity classification, and query synthesis all at once. Decompose those into sequential sub-tasks and feed each solution forward as context, and the story changes.

Their pipeline has four stages: (1) a schema-linking module that uses chain-of-thought prompting to map the natural language question to specific columns and values in the schema; (2) a classification-and-decomposition module that bins the query into easy (single table, no joins), non-nested complex (joins but no subqueries), or nested complex (joins, subqueries, set operations) and, for nested queries, decomposes the problem into sub-queries; (3) a SQL-generation module that matches the prompting strategy to the complexity class — simple few-shot for easy, NatSQL intermediate representation for non-nested, multi-step chain-of-thought for nested; and (4) a self-correction module that asks the model to review its own output for minor errors like missing DISTINCT or misplaced DESC.

Key ideas

GPT-4 + DIN-SQL reaches 85.3% execution accuracy on Spider's holdout test set, a +5.4 point jump over the then-SOTA fine-tuned RESDSQL-3B+NatSQL (79.9%), without any task-specific training data.
On Spider dev, the decomposition pipeline moves GPT-4 from 67.4% (few-shot baseline) to 74.2% — a clean +6.8 point gain. CodeX Davinci moves from 61.5% to 69.9%.
Ablation confirms every stage contributes: removing schema linking alone drops CodeX from 69.9% to 65.9%; removing classification drops it further to 63.1%.
The gains are concentrated in easy and medium queries. On "extra hard" queries, even the full DIN-SQL + GPT-4 pipeline only reaches 43.4% on Spider dev — decomposition doesn't solve complexity, it just reduces avoidable errors on tractable queries.
Self-correction is model-sensitive: GPT-4 responds to "gentle" prompts that ask for potential improvements; CodeX responds better to "generic" prompts that assume the SQL is wrong. This suggests the module is doing stylistic cleanup, not real semantic verification.
On BIRD dev, DIN-SQL + GPT-4 scores 50.72% vs. a plain GPT-4 baseline of 46.35% — a +4.4 point improvement, substantially smaller than the Spider gains, which I'll come back to.

What holds up — and what doesn't

The core result is real. Decomposing text-to-SQL into explicit sub-tasks improves performance, and the ablation studies are clean enough to believe the individual module contributions. Schema linking matters the most for hard queries, which makes sense: the model can't generate correct JOINs if it hasn't first correctly identified which tables and columns the question refers to.

But several things give me pause. The 74.2% dev vs. 85.3% test discrepancy is suspicious. Dev sets are usually harder or at least as hard as test sets because models are implicitly tuned against them; a +11 point jump going from dev to test is unusual. The authors don't explain this, and it makes me wonder whether the test set has a different distribution of query difficulties, or whether there's something about how the holdout test is evaluated (via the official Spider leaderboard server) that differs from their dev set evaluation. I wouldn't cite 85.3% without that caveat.

The BIRD gap (50.72% dev) is notably smaller than the Spider gains. BIRD databases have messy real-world schemas with abbreviated column names, domain-specific terminology, and ambiguous values. The DIN-SQL schema-linking module was designed with Spider's relatively clean schemas in mind; when the schemas get dirtier, linking accuracy drops and the rest of the pipeline degrades with it. This is exactly what the BIRD paper measured, and DIN-SQL doesn't solve it.

The cost and latency numbers are a problem for any production system: roughly $0.50 and 60 seconds per question with GPT-4. That's fine for a data analyst running ten queries a day but completely unworkable for interactive use. The paper presents this as a known limitation but doesn't propose a path to reducing it. DAIL-SQL (arXiv:2308.15363), which appeared a few months later, would show that better example selection rather than explicit decomposition could reach 86.6% on Spider — surpassing DIN-SQL at meaningfully lower cost.

The self-correction module is the weakest part. The authors acknowledge it catches "minor" errors. What it can't do is detect semantic errors — cases where the generated SQL is syntactically valid and even executes, but answers the wrong question. That's the harder failure mode for real users.

Why this matters for finance AI

Beanquery (BQL) is a SQL-like query language over Beancount ledger data. It has its own table structure — transactions, postings, balance, prices — with account hierarchies, tags, and metadata fields that look nothing like generic database schemas. A natural language interface to BQL is a real and useful thing (there's already an experimental beanquery-mcp server implementing exactly this via MCP), and DIN-SQL's decomposition strategy is the right starting point.

Schema linking over BQL is the analogous problem to schema linking over relational tables, but with two extra wrinkles: account names are hierarchical paths like Assets:US:Checking:Bank, and the relevant schema depends on what kind of query the user is asking (income statement, balance sheet, cash flow). DIN-SQL's classification module suggests a direct adaptation: classify the query intent (balance vs. flow vs. price lookup) first, then route to different prompt templates.

The BIRD benchmark's finding that real-world messiness hurts LLM text-to-SQL is directly relevant. A Beancount ledger is also "messy" — user-defined account names, inconsistent commodity symbols, custom metadata keys. The 4.4 point BIRD improvement vs. the 6.8 point Spider improvement tells me that the structured, clean-schema regime overestimates how much decomposition will help on actual BQL queries. Expect smaller gains in practice.

The cost constraint is real but less binding here. A personal finance user running 10–20 queries per day can tolerate $5–10/day in API costs if the interface is genuinely useful. The latency (60 seconds) is more of a problem for interactive use; batching analytical queries might sidestep it.

DIN-SQL: Decomposed In-Context Learning for Text-to-SQL

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next