BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

June 6, 2026 · 6 min read

Mike Thrift

Marketing Manager

The BIRD benchmark (NeurIPS 2023 Spotlight) is the paper I keep meaning to read whenever someone argues that GPT-4 can "query a database in plain English." It asks a pointed question: can LLMs actually serve as a database interface on real databases, not academic toy schemas? The answer is sobering in ways that map almost directly onto what a natural-language query layer for Beancount ledgers would face.

The paper

2026-06-06-bird-benchmark-text-to-sql-real-database-gap

"Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs" by Jinyang Li and a large team from DAMO Academy, HKU, UIUC, and others introduces BIRD: 12,751 question–SQL pairs over 95 real databases totalling 33.4 GB across 37 professional domains. That scale is the point. Spider and WikiSQL, the two benchmarks that dominated text-to-SQL research before this, use small clean databases with at most a few hundred rows. BIRD uses databases lifted from real institutions — financial records, toxicology reports, government datasets — where values are dirty, column semantics require domain knowledge, and query efficiency actually matters. The paper also introduces two metrics: Execution Accuracy (EX), which checks whether the SQL result matches the gold answer, and the Valid Efficiency Score (VES), which penalises correct-but-slow queries.

Key ideas

GPT-4 achieves only 54.89% execution accuracy on the test set when provided with curated external knowledge evidence. Without that evidence it falls to 34.88% — a gap of 20 percentage points that reveals how much the model leans on the provided hints rather than on its own world knowledge.
Human performance sits at 92.96% on the dev set, leaving a 38-point gap even after GPT-4 is given the answers' domain context.
External knowledge is provided as a per-question "evidence sentence" (e.g., "account.type = 'OWNER' means the account holder is the primary owner"). Models that cannot retrieve or infer this context on their own are essentially hobbled from the start.
The financial domain, which is most relevant to Beancount, carries the highest annotation noise rate: a follow-up audit found roughly 49% of financial-domain data points contain some error — spelling mistakes, ambiguous questions, or incorrect gold SQL queries.
The leaderboard has moved considerably since publication. As of 2026, the leading system (AskData + GPT-4o) reaches 81.95% on the test set, with human performance still at ~92.96%, but the gap closed mainly through elaborate multi-step pipelines, not raw model capability.

What holds up — and what doesn't

The core contribution holds: Spider-style benchmarks genuinely understated the difficulty of text-to-SQL by using sanitised schemas. BIRD's insistence on real database values and external knowledge reveals failure modes that never show up on clean data, and the 20 pp swing from adding knowledge evidence is a reproducible and important finding.

But the benchmark has a design flaw that its own follow-up work acknowledges. The external knowledge evidence is hand-written, per-query, by annotators with domain expertise. That is not a realistic deployment scenario. A real NL-to-SQL agent does not get a pre-written hint for every question; it must retrieve or infer the relevant domain context itself. The SEED paper (2025) shows that automatically generated evidence can match or exceed hand-written evidence in some settings, which weakens BIRD's implicit assumption that the knowledge bottleneck is the hard part.

The noise audit is more damaging. Twenty-two gold SQL queries in the dataset are outright wrong. When those are corrected, model rankings shift: zero-shot GPT-3.5 outperforms DIN-SQL and MAC-SQL, which are designed to beat GPT-3.5 on the uncorrected benchmark. That is a red flag. A benchmark whose rankings reverse on cleanup is teaching us about annotation artifacts as much as about model capability. The financial domain's 49% noise rate in particular makes domain-specific conclusions unreliable.

There is also a subtler issue with VES. Rewarding query efficiency is a sensible real-world goal, but for a benchmark to train and evaluate on efficiency, you need ground truth about what "efficient" means for a specific database engine and data distribution. VES works here because BIRD controls the execution environment. That condition would not hold for a Beancount agent running beanquery against a user's personal ledger on heterogeneous hardware.

Why this matters for finance AI

Beancount's query language, BQL (exposed via the bean-query CLI and beanquery library), is syntactically close to SQL: it supports SELECT, WHERE, GROUP BY, aggregation functions, and joins across the built-in posting and balance tables. A natural-language interface that translates user questions into BQL is the most natural onramp for non-technical users, and BIRD's findings directly frame the challenge.

The external knowledge problem in BIRD maps cleanly onto Beancount. A user might ask "how much did I spend on medical expenses last year?" and the agent needs to know that the user's medical costs live under Expenses:Health:* or Expenses:Medical, depending on how they organised their accounts. That mapping is personal, not in any training corpus. BIRD's finding that GPT-4 loses 20 points without evidence suggests that any BQL-generation agent needs a retrieval step that learns the user's own account taxonomy — essentially a per-user knowledge base.

The dirty-data problem also maps directly. Imported bank transactions often have inconsistent merchant names, OCR artifacts, and mixed encodings. BIRD quantifies what this costs in terms of SQL correctness, and the number is large enough to make pre-processing a first-class concern rather than an afterthought.

What BIRD does not cover: ledger-specific constructs like balance assertions, pad directives, or multi-currency postings have no equivalent in standard SQL, so any BQL agent will face a layer of complexity that BIRD does not measure. The benchmark is a useful lower bound, not a ceiling.

What to read next

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows (arXiv:2502.04306, ICLR 2025 Oral) — extends BIRD to enterprise environments with cloud databases and multi-file workflows; the natural next step for understanding real-world deployment gaps.
SEED: Enhancing Text-to-SQL Performance and Practical Usability Through Automatic Evidence Generation (arXiv:2506.07423) — directly addresses BIRD's hand-written evidence assumption with an automated pipeline.
DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction (arXiv:2304.11015, NeurIPS 2023) — one of the top BIRD baselines; shows how decomposing a complex SQL query into sub-problems improves accuracy, a technique directly applicable to multi-step BQL queries over Beancount ledgers.

Share on Twitter Follow @beancount_io

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next