MAC-SQL: Multi-Agent Collaborative Text-to-SQL

June 8, 2026 · 6 min read

Mike Thrift

Marketing Manager

MAC-SQL arrived in December 2023 as the most explicitly agent-centric answer to the text-to-SQL problem: instead of one prompt generating one query, three specialized agents collaborate to select a relevant sub-schema, decompose the question, and repair the SQL after execution. I'm reading it because the two previous entries covered BIRD (the benchmark MAC-SQL topped at submission) and DIN-SQL (the decomposition baseline MAC-SQL extends), and the natural question is whether a multi-agent wrapper buys anything concrete on top of those foundations.

The paper

2026-06-08-mac-sql-multi-agent-collaborative-text-to-sql

"MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL" (Wang et al., COLING 2025) targets a failure mode that BIRD exposed in earlier single-prompt methods: large databases with noisy schemas and complex multi-step questions overwhelm models that try to reason about everything in one shot.

The architecture has three agents. A Selector narrows a large database down to a relevant sub-schema by filtering out irrelevant tables and columns before any SQL generation begins. A Decomposer is the core engine — it breaks complex natural-language questions into sub-problems and generates SQL incrementally with few-shot chain-of-thought reasoning. A Refiner executes the candidate SQL against the real database, reads any error messages verbatim, and iteratively corrects the query up to a maximum retry limit. Not all three agents activate on every query; simpler tasks skip the Selector or the Refiner based on complexity signals.

The authors also fine-tune SQL-Llama (Code Llama 7B) on outputs produced by the framework, providing a smaller open-source variant.

Key ideas

Schema reduction before generation: The Selector filters the database to a relevant sub-schema before the Decomposer writes SQL. Ablation confirms +2.11 percentage points on BIRD dev — real, but modest.
Execution-guided refinement: The Refiner reads actual database error messages and corrects the SQL. This is the largest single contributor in ablation: removing it drops BIRD dev accuracy by 4.63 points, more than removing the Selector (−2.11) or even the Decomposer (−3.85).
Conditional agent dispatch: Routing simpler queries past the Selector and Refiner saves tokens without hurting accuracy on easy cases.
Open-source distillation gap: SQL-Llama (7B) reaches 43.94% on BIRD dev versus GPT-4's 46.35% baseline. The gap is not dramatic given the parameter count difference, but the fine-tuned 7B model still trails GPT-4+MAC-SQL's full 59.59% test score by 15+ points.
BIRD test result: 59.59% execution accuracy, topping the leaderboard at submission time and outperforming DAIL-SQL+GPT-4 (57.41%) by 2.18 points.

What holds up — and what doesn't

The Refiner is the best idea here, and the ablation shows it. An agent that reads a real database error message and corrects its own SQL is doing something genuinely more principled than an LLM second-guessing itself in a vacuum — this is the CRITIC "tool-interactive critiquing" principle applied directly and concretely to SQL execution feedback.

The Selector's contribution is positive but thin. For databases with hundreds of tables it probably matters more; for BIRD's typical schema it's marginal, and the paper doesn't report how often the Selector fires or its precision at keeping relevant columns — it's a black box with a single aggregate number.

The Decomposer is incremental over DIN-SQL. DIN-SQL already decomposed queries into sub-problems with self-correction; MAC-SQL repackages this as a multi-agent conversation. The architectural split into three named agents is closer to a software design choice than a new inference algorithm. Whether the three-agent prompt structure outperforms a single agent with a longer prompt, controlling for total tokens, is never tested. The acknowledged limitations — prompts "not extensively engineered" and fine-tuning capped at 7B — are real, but the more substantive omission is that ablation on prompt length versus architecture is absent entirely.

The temporal context matters for calibration. MAC-SQL's 59.59% on BIRD test was state of the art in December 2023. By mid-2025 the BIRD leaderboard shows systems pushing past 81%. The specific ideas — sub-schema filtering, question decomposition, execution retry — have been absorbed and extended by subsequent work using reasoning-first training, RLVR, and richer CoT. MAC-SQL as an artefact looks dated; MAC-SQL as an architectural pattern remains current.

Why this matters for finance AI

Beancount uses beanquery — a SQL-adjacent query language — as its primary programmatic interface over ledger data. A real multi-year beancount file has a schema that includes dozens of accounts organized in a hierarchy, multiple currencies, metadata tags, and computed balance columns. That is precisely the large, noisy-schema problem the Selector targets.

The Decomposer directly applies to the kinds of queries users actually ask: "What was my total dining expenditure in EUR in Q3 2024 excluding reimbursed transactions, broken down by month?" is a decomposition problem — filter by account prefix, filter by date range, exclude flagged transactions, aggregate per month. The Refiner translates naturally too: before committing a generated beancount entry, an agent could dry-run it through the beancount parser, receive syntax or balance errors, and revise. The execution-feedback loop MAC-SQL demonstrates is the same loop a write-back safety layer needs.

The open-source distillation result is a caution: fine-tuning a 7B model to approximate a GPT-4-based pipeline yields a model that is still far behind. If Bean Labs builds a local model for ledger query generation, the gap from MAC-SQL suggests that small models need domain-specific training data far beyond what a general-purpose fine-tune provides.

MAC-SQL: Multi-Agent Collaborative Text-to-SQL

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next