TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

9 de juny del 2026 · 6 minuts de lectura

Mike Thrift

Marketing Manager

I've been spending time on the text-to-SQL lineage — BIRD, DIN-SQL, MAC-SQL — but all of those share an assumption I want to question: that the right interface for table QA is generating SQL. TAPAS, published by Herzig et al. at Google Research (ACL 2020), takes the opposite bet. It never generates a query. Instead it just selects cells and optionally applies a scalar aggregation, trained end-to-end from answer denotations alone.

The paper

2026-06-09-tapas-weakly-supervised-table-parsing-pretraining

TAPAS extends BERT to encode tables by adding six new embedding dimensions on top of the standard position and segment IDs. Column ID and Row ID mark where each token lives in the table grid. A Rank ID encodes the relative numeric order within sortable columns (rank 0 means not comparable, rank i+1 for the i-th smallest value). A Previous Answer indicator flags cells that were selected in the prior conversational turn. Combined with the binary segment embedding that distinguishes question tokens from table tokens, that gives TAPAS its seven-type token representation.

At inference time the model selects a set of cells by thresholding per-cell probabilities, then applies one of four aggregation operators — NONE, COUNT, SUM, or AVERAGE — to produce the final answer. There is no intermediate SQL or logical form. Pre-training runs a standard masked language model objective over 6.2 million English Wikipedia table–text pairs.

Key ideas

The column/row embeddings are load-bearing. Ablation shows that removing them costs 19.4 accuracy points on SQA, 10.6 on WikiSQL, and 11.6 on WikiTQ — far larger than any other architectural component.
Table pre-training matters almost as much. Removing it drops SQA by 12.5 points and WikiTQ by 11.1 points even after fine-tuning.
On SQA (conversational table QA), TAPAS raises accuracy from 55.1% to 67.2%, a 12-point jump. The Previous Answer token embedding is the mechanism that makes conversational carry-over work without a separate state tracker.
On WikiSQL (single-table, mostly lookup and aggregate), TAPAS reaches 83.6% — essentially matching the 83.9% SOTA semantic parser, with zero SQL generation.
Transfer learning from WikiSQL to WikiTQ (multi-step, multi-column reasoning) yields 48.7%, 4.2 points above the state-of-the-art at the time. SQA transfer gives 48.8%.
Weak supervision is the key affordability claim: the model is trained on (question, answer) pairs, not (question, SQL, answer) triples, so you can annotate large corpora without SQL expertise.

What holds up — and what doesn't

The core insight — that many table QA questions can be answered by selecting cells and applying one of four scalar operations — is empirically sound for the benchmarks tested. But the paper's honest error analysis on WikiTQ is telling: 37% of errors are unclassified by the authors themselves, 16% require string manipulation the model can't do, and 10% involve complex temporal reasoning. That means TAPAS's ceiling is not about the aggregation operators being wrong; it is about whole categories of question being structurally out of scope.

The 512-token constraint is a hard wall. Tables with more than roughly 500 cells must be truncated, and the model has no mechanism for multi-table reasoning. This is not a tuning problem — it is an architectural one. The model also cannot nest aggregations: a question like "how many accounts have an average balance greater than zero?" requires two passes (average inside a COUNT predicate), which the four-operator head cannot express.

TAPEX (ICLR 2022) directly addresses the pre-training bottleneck by replacing Wikipedia table MLM with synthetic SQL execution on auto-generated programs, pushing WikiTQ to 57.5% (+4.8) and SQA to 74.5% (+3.5). That is a meaningful gap. But TAPEX inherits the same architectural limits on table size and compositional depth.

The deeper unresolved question neither paper addresses is whether the cell-selection paradigm is a better fit for real-world table QA than SQL generation on practical grounds — not benchmark accuracy, but auditability and correctness guarantees. Selecting cells is opaque: you get an answer but no program. SQL generation is verbose but verifiable. For production use, that tradeoff matters more than a few accuracy points.

Why this matters for finance AI

A Beancount ledger is effectively a flat, structured table: accounts in rows, amounts, dates, currencies, and tags in columns. TAPAS's direct cell-selection paradigm maps naturally onto the most common ledger queries — "what is the total spent on groceries in March?" — which are exactly SUM and COUNT aggregations over filtered rows. The Previous Answer embedding is directly useful for conversational sessions where a user refines a query ("and what about last year?").

But Beancount ledgers at scale break TAPAS's constraints. A multi-year ledger with thousands of transactions exceeds the 512-token budget by orders of magnitude. Account hierarchies require reasoning across row groups. Queries like "which accounts have a net outflow greater than their average over the last three years?" need nested aggregations that the four-operator head cannot express. And critically: for write-back safety, cell selection gives no auditable program to check before committing a change. SQL at least gives an inspectable artifact.

My tentative conclusion is that the cell-selection paradigm is the right interface for a natural-language read-only query layer over small ledger snapshots — a month's transactions, a single account's history. For full-ledger reasoning and anything involving write-back, a program-synthesis approach (whether SQL-style or Beancount DSL) remains safer and more expressive.

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Comença amb Beancount.io

Primers passos

Funcions

Comunitat

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Comença amb Beancount.io

Primers passos

Funcions

Comunitat

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next