TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning
Reading TAT-QA today because it sits at an intersection that matters directly to what we're building: questions that can only be answered by reasoning across a table and the surrounding text simultaneously. In Beancount, every ledger entry exists in context — a table row that makes no sense without the memo, the counterparty narrative, or the account policy that explains why that line item is there. TAT-QA, published at ACL 2021 by Zhu et al. from the NExT++ lab at NUS, is the benchmark that forced the NLP community to confront this problem head-on.
The paper
Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua introduce TAT-QA (Tabular And Textual QA), a dataset of 16,552 questions over 2,757 hybrid contexts drawn from real financial annual reports. Each context pairs a semi-structured table with at least two accompanying paragraphs — exactly the structure you find in 10-K filings, where a revenue table sits next to management's discussion of what drove the numbers. Nearly all questions require arithmetic: addition, subtraction, multiplication, division, counting, comparison, sorting, and multi-operation compositions.
The core contribution is twofold: the benchmark itself, and TAGOP, a new model that treats the task as evidence tagging followed by symbolic reasoning. TAGOP uses a sequence tagger over the concatenated table cells and text spans to identify which pieces of evidence to collect, then applies a fixed set of aggregation operators (sum, difference, product, ratio, count, etc.) to compute the final answer. No neural arithmetic — the calculation itself is always delegated to a symbolic executor.
Key ideas
- Evidence identification is the hard part, not the arithmetic. TAGOP's error analysis attributes roughly 55% of failures to incorrect tagging and 29% to missing evidence. Once you have the right cells and spans, the symbolic executor rarely makes a computation error. This is a direct signal: for finance agents, the retrieval and grounding step dominates.
- Text-only models fail immediately. BERT-RC manages only 18.7% F1 on the test set. NumNet+ V2, the best pre-TAT-QA numerical reader, reaches 46.9% F1. The table-only TaPas baseline scores 22.8% F1. A model that reads tables without text — or text without tables — is disqualified from this domain.
- TAGOP scores 58.0% F1 (50.1% exact match), human experts score 90.8% F1 (84.1% EM). The 32.8-point F1 gap at publication time was alarming. It meant that even the best 2021 system answers fewer than two-thirds of questions a trained analyst can handle.
- By late 2024, the leaderboard tells a different story. The top system, TAT-LLM (70B), reaches 88.4% F1 — only 2.4 points below human. TAT-LLM (7B) reaches 82.88% F1, and GPT-4 in zero-shot reaches 79.71% F1. The gap closed dramatically, mostly through LLM-scale fine-tuning.
- Specialized fine-tuning still beats raw GPT-4. TAT-LLM 7B (74.56% EM) outperforms GPT-4 zero-shot (71.92% EM) on TAT-QA, even at a fraction of the parameter count. The step-wise Extractor→Reasoner→Executor pipeline that TAT-LLM uses mirrors TAGOP's intuition but replaces the symbolic tagger with a prompted LLM.
What holds up — and what doesn't
The benchmark is real data, real questions, real financial reports. That credibility is its biggest asset. The 32-point human-model gap at publication was genuine and the dataset is hard enough that even five years later the top systems haven't fully closed it.
What concerns me is the single-table assumption. Each TAT-QA context contains exactly one table. Real annual reports contain dozens, often with hierarchical relationships across segments, subsidiaries, and time periods. A model that can answer TAT-QA questions perfectly is still unprepared for the cross-table consolidation that dominates real accounting work. The MMQA paper (ICLR 2025) makes exactly this point — that single-table benchmarks like TAT-QA understate the multi-table complexity practitioners face.
The answer-type distribution is also not as hard as it looks in practice. About 42% of TAT-QA answers are single spans — direct extractions requiring no calculation. The challenging multi-operation compositions are a minority. A model that gets all extractions right and all arithmetic wrong would still score somewhere in the 30–40% range. The benchmark doesn't weight by difficulty, which flattens the signal from the truly hard reasoning cases.
Finally, the human baseline (90.8% F1) was computed using annotators who had access to the document but may not have been CPA-level experts. For Beancount-scale ledger reasoning — where an agent must understand accounting policy, not just arithmetic — 90.8% may be an overestimate of the "correct" ceiling.
Why this matters for finance AI
TAT-QA is the closest public benchmark to what a Beancount agent faces on a daily basis: structured entry data (table) sitting next to unstructured narrative (memo, description, policy note). The TAGOP result confirms what I'd expect from building ledger tooling — grounding is harder than computing. Getting the right cells tagged is the problem; summing them is trivial.
The leaderboard trajectory is encouraging for product: a 7B parameter model fine-tuned on this domain outperforms GPT-4 zero-shot, which suggests that a Beancount-specific fine-tuned model could handle the retrieval+arithmetic workload without needing frontier model API calls for every ledger query. Latency, cost, and data privacy all improve if we can run a compact specialist locally.
The single-table limitation is the direct gap to close for Bean Labs. Beancount ledgers are effectively multi-table documents — account postings, budget lines, reconciliation notes — and the benchmark that captures that multi-hop structure across related tables doesn't fully exist yet. MultiHiertt (ACL 2022) is the closest thing; it's the next paper on my list.
What to read next
- MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data (arXiv:2206.01347, ACL 2022) — directly addresses TAT-QA's single-table limitation; questions require reasoning across multiple hierarchical tables within the same financial document, closer to what consolidated ledger statements look like
- ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering (arXiv:2210.03849, EMNLP 2022) — extends FinQA to multi-turn dialogue; models must track running numerical context across question turns, which maps to how a Beancount agent handles follow-up queries about a ledger session
- TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data (arXiv:2401.13223, ICAIF 2024) — the direct follow-up from the same NExT++ group; shows how LLaMA-2 fine-tuned with an Extractor→Reasoner→Executor pipeline beats GPT-4 zero-shot on TAT-QA and FinQA
