MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables
Every financial QA benchmark I've read this month — FinQA, TAT-QA, ConvFinQA — rests on the same silent assumption: one flat table per document. Real financial reports look nothing like that. Consolidated balance sheets nest subsidiaries inside segments inside parent entities; income statements carry hierarchical line items with sub-totals that themselves feed higher aggregates. MultiHiertt (Zhao et al., ACL 2022) is the first benchmark dataset built to expose exactly this gap, and the numbers that come out of it are sobering.
The paper
Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang at Penn State introduce MultiHiertt, a QA benchmark of 10,440 question-answer pairs drawn from 2,513 real financial reports. Each document averages 3.89 hierarchical tables alongside 68 sentences (~1,645 words) of narrative text. The train/dev/test split is 7,830 / 1,044 / 1,566. The core argument is simple but pointed: prior datasets (FinQA, TAT-QA) evaluate models on documents with a single flat table, which systematically understates the complexity of reasoning over actual financial filings where a question may require synthesizing numbers from three separate sub-tables before applying an arithmetic program.
Along with the dataset, the authors propose MT2Net, a two-stage model: a facts-retrieval module that scores candidate supporting cells and text spans from all tables and paragraphs, followed by a symbolic reasoning module (an arithmetic program executor borrowed from FinQA's NeRd design) that operates over the retrieved facts. MT2Net uses RoBERTa-large as its encoder throughout.
Key ideas
- MultiHiertt's 3.89 average tables per document directly mirrors real annual-report structure, where a single question can require values from the income statement, a segment breakdown table, and a footnote schedule — none of which are flat
- MT2Net (RoBERTa-large) achieves 38.43% F1 on the test set; human experts score 87.03% F1 — a gap of nearly 49 points
- Cross-table reasoning questions (requiring evidence from ≥ 2 tables) score 21.04% F1 under the best model, versus 36.77% for single-table questions — a drop of more than 15 points from an already low baseline
- The symbolic reasoning module helps but cannot compensate for retrieval failures: the annotation study shows 31.5% of errors on hierarchical examples come from selecting the wrong evidence cells before any arithmetic is attempted
- By 2024, GPT-4 with Program-of-Thoughts prompting reaches 67.23% F1 on MultiHiertt, and a dedicated EEDP (evidence-enhanced document prompting) method pushes GPT-4 to 70.32% — still 17 points below human ceiling
- Annotation quality is solid: inter-annotator Kappa of 0.72–0.90, with 76.8%–94.0% of samples rated ≥ 4/5 for correctness by crowd workers
What holds up — and what doesn't
The dataset construction is careful and the annotation quality metrics are reassuring. The core claim — that single-table benchmarks understate real complexity — is obviously true and the 15-point F1 gap between single- and multi-table subsets makes it concrete. The comparison table (Table 1 in the paper) clearly shows that FinQA and TAT-QA have one table per document; MultiHiertt is genuinely filling a real gap.
That said, MT2Net is not a strong proposed solution — it's closer to a strong baseline. The retrieval module is a span-level scorer trained with supervision on supporting facts, which means it depends heavily on having correct supervision signal at training time. The paper does not evaluate what happens when the hierarchical structure is implicit (no explicit parent-child HTML nesting), which is common in scanned filings and older PDFs. The test set is withheld behind a CodaLab leaderboard, which makes it hard to independently replicate results or probe failure modes.
I also want to flag something the authors underemphasize: the 2024 GPT-4 results show that raw reasoning power can close much of the gap without any architecture specifically designed for hierarchy. GPT-4 gets to 70% without ever being told the document has hierarchical tables — it just reads the rendered HTML. That is actually an interesting finding: hierarchy awareness may matter less than sheer context capacity and arithmetic reliability. The binding constraint may still be retrieval precision over long documents, not reasoning architecture.
Why this matters for finance AI
Beancount agents face this exact problem. A question like "what was our effective tax rate in 2023?" requires finding the pre-tax income line from the income statement, the income tax expense from a separate note, and possibly a segment-level breakdown to reconcile the consolidated figure. None of those live in a single flat table. The 15-point F1 penalty for cross-table reasoning in MultiHiertt quantifies what I would expect to see in a Beancount context: agents that look good on single-account queries will degrade significantly when a question requires joining across ledger sections.
The error analysis is directly actionable. If 31.5% of errors are wrong-evidence retrievals before any calculation happens, then the priority for a Beancount write-back agent is not a better arithmetic engine — it is a better evidence selector. An agent that retrieves the wrong ledger lines before doing the math will produce plausible-looking but wrong entries, exactly the failure mode that is hardest to catch in an audit.
The GPT-4 trajectory is also encouraging for the near term: moving from 38% to 70% over two years suggests that multi-table financial reasoning is tractable as context windows and reasoning improve, even without domain-specific training. But the remaining 17-point gap to human performance is not noise — it likely reflects cases where hierarchical structure carries semantic load that flat text rendering loses.
What to read next
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., NeurIPS 2020) — arXiv:2005.11401 — the foundation that almost every finance QA system builds on; understanding its parametric vs. non-parametric memory split matters for deciding how to structure ledger retrieval
- FLARE: Active Retrieval Augmented Generation (Jiang et al., EMNLP 2023) — arXiv:2305.06983 — retrieves mid-generation when the model predicts it needs new facts, which is a natural fit for multi-table reasoning where you discover mid-reasoning that you need a subsidiary table
- TAT-LLM: A Specialized Language Model for Discrete Reasoning over Financial Tabular and Textual Data (Zhao et al., ICAIF 2024) — fine-tunes an LLM specifically on FinQA/TAT-QA/MultiHiertt and shows what domain adaptation actually buys over GPT-4 prompting
