Salta al contingut principal

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

· 6 minuts de lectura
Mike Thrift
Mike Thrift
Marketing Manager

I've been spending time on the text-to-SQL lineage — BIRD, DIN-SQL, MAC-SQL — but all of those share an assumption I want to question: that the right interface for table QA is generating SQL. TAPAS, published by Herzig et al. at Google Research (ACL 2020), takes the opposite bet. It never generates a query. Instead it just selects cells and optionally applies a scalar aggregation, trained end-to-end from answer denotations alone.

The paper

2026-06-09-tapas-weakly-supervised-table-parsing-pretraining

TAPAS extends BERT to encode tables by adding six new embedding dimensions on top of the standard position and segment IDs. Column ID and Row ID mark where each token lives in the table grid. A Rank ID encodes the relative numeric order within sortable columns (rank 0 means not comparable, rank i+1 for the i-th smallest value). A Previous Answer indicator flags cells that were selected in the prior conversational turn. Combined with the binary segment embedding that distinguishes question tokens from table tokens, that gives TAPAS its seven-type token representation.

At inference time the model selects a set of cells by thresholding per-cell probabilities, then applies one of four aggregation operators — NONE, COUNT, SUM, or AVERAGE — to produce the final answer. There is no intermediate SQL or logical form. Pre-training runs a standard masked language model objective over 6.2 million English Wikipedia table–text pairs.

Key ideas

  • The column/row embeddings are load-bearing. Ablation shows that removing them costs 19.4 accuracy points on SQA, 10.6 on WikiSQL, and 11.6 on WikiTQ — far larger than any other architectural component.
  • Table pre-training matters almost as much. Removing it drops SQA by 12.5 points and WikiTQ by 11.1 points even after fine-tuning.
  • On SQA (conversational table QA), TAPAS raises accuracy from 55.1% to 67.2%, a 12-point jump. The Previous Answer token embedding is the mechanism that makes conversational carry-over work without a separate state tracker.
  • On WikiSQL (single-table, mostly lookup and aggregate), TAPAS reaches 83.6% — essentially matching the 83.9% SOTA semantic parser, with zero SQL generation.
  • Transfer learning from WikiSQL to WikiTQ (multi-step, multi-column reasoning) yields 48.7%, 4.2 points above the state-of-the-art at the time. SQA transfer gives 48.8%.
  • Weak supervision is the key affordability claim: the model is trained on (question, answer) pairs, not (question, SQL, answer) triples, so you can annotate large corpora without SQL expertise.

What holds up — and what doesn't

The core insight — that many table QA questions can be answered by selecting cells and applying one of four scalar operations — is empirically sound for the benchmarks tested. But the paper's honest error analysis on WikiTQ is telling: 37% of errors are unclassified by the authors themselves, 16% require string manipulation the model can't do, and 10% involve complex temporal reasoning. That means TAPAS's ceiling is not about the aggregation operators being wrong; it is about whole categories of question being structurally out of scope.

The 512-token constraint is a hard wall. Tables with more than roughly 500 cells must be truncated, and the model has no mechanism for multi-table reasoning. This is not a tuning problem — it is an architectural one. The model also cannot nest aggregations: a question like "how many accounts have an average balance greater than zero?" requires two passes (average inside a COUNT predicate), which the four-operator head cannot express.

TAPEX (ICLR 2022) directly addresses the pre-training bottleneck by replacing Wikipedia table MLM with synthetic SQL execution on auto-generated programs, pushing WikiTQ to 57.5% (+4.8) and SQA to 74.5% (+3.5). That is a meaningful gap. But TAPEX inherits the same architectural limits on table size and compositional depth.

The deeper unresolved question neither paper addresses is whether the cell-selection paradigm is a better fit for real-world table QA than SQL generation on practical grounds — not benchmark accuracy, but auditability and correctness guarantees. Selecting cells is opaque: you get an answer but no program. SQL generation is verbose but verifiable. For production use, that tradeoff matters more than a few accuracy points.

Why this matters for finance AI

A Beancount ledger is effectively a flat, structured table: accounts in rows, amounts, dates, currencies, and tags in columns. TAPAS's direct cell-selection paradigm maps naturally onto the most common ledger queries — "what is the total spent on groceries in March?" — which are exactly SUM and COUNT aggregations over filtered rows. The Previous Answer embedding is directly useful for conversational sessions where a user refines a query ("and what about last year?").

But Beancount ledgers at scale break TAPAS's constraints. A multi-year ledger with thousands of transactions exceeds the 512-token budget by orders of magnitude. Account hierarchies require reasoning across row groups. Queries like "which accounts have a net outflow greater than their average over the last three years?" need nested aggregations that the four-operator head cannot express. And critically: for write-back safety, cell selection gives no auditable program to check before committing a change. SQL at least gives an inspectable artifact.

My tentative conclusion is that the cell-selection paradigm is the right interface for a natural-language read-only query layer over small ledger snapshots — a month's transactions, a single account's history. For full-ledger reasoning and anything involving write-back, a program-synthesis approach (whether SQL-style or Beancount DSL) remains safer and more expressive.

  • TAPEX: Table Pre-training via Learning a Neural SQL Executor (arXiv:2107.07653, ICLR 2022) — the direct successor that replaces Wikipedia MLM with synthetic SQL execution; directly answers whether pre-training on programs beats pre-training on text for table QA
  • Binder: Binding Language Models in Symbolic Languages (arXiv:2210.02875) — uses GPT-3 to generate programs in SQL or Python over tables and achieves SOTA on WikiTQ; the hybrid approach that cell-selection advocates need to reckon with
  • OmniTab: Natural and Artificially Structured Data for Table QA (arXiv:2207.02270) — combines natural table corpora with synthetic SQL data in a single pre-training recipe; tests whether TAPAS and TAPEX are complementary rather than competing