Skip to main content

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

Tables are how accountants think. A Beancount ledger is essentially a table — accounts as rows, dates and amounts as columns, assertions as constraints across cells. So when I started asking whether LLMs can power autonomous finance agents, I kept running into the same prior question: can they even reliably read a table? The literature on this is more damning than I expected.

The paper

2026-04-22-can-llms-reason-over-tabular-data

Fang et al. published "Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding — A Survey" in TMLR 2024 (arXiv:2402.17944). It's a 41-page taxonomy covering three domains: predicting structured outcomes from tabular features, generating synthetic tabular data, and understanding tables well enough to answer questions about them. The understanding track — table question answering (TableQA), fact verification, and structural reasoning — is where the most relevant work for finance AI lives.

The paper I read alongside it, "Table Meets LLM: Can Large Language Models Understand Structured Table Data?" by Sui et al. (WSDM 2024, arXiv:2305.13062), takes a more controlled approach: they define a Structural Understanding Capability (SUC) benchmark with seven narrow tasks — table partition, size detection, merged cell detection, cell lookup, reverse lookup, column retrieval, and row retrieval — and test GPT-3.5 and GPT-4 directly. No reasoning chains, no retrieval tricks. Just: can the model do what we ask?

Key ideas

  • The format gap is real and surprisingly large. On the SUC benchmark, HTML serialization outperforms natural-language-with-separators format by roughly 6.76% overall. The ranking — HTML > XML > JSON > Markdown > NL+Sep — holds consistently across tasks. Beancount files are closer to the natural-language end of this spectrum, which is a warning sign.
  • Cell lookup is surprisingly hard. GPT-3.5 achieves only 44% accuracy on direct cell lookup (find the value at row X, column Y). GPT-4 reaches 73.34% on the same task. For a deterministic operation a spreadsheet formula handles in microseconds, a gap of 26 percentage points between models is alarming.
  • Few-shot examples are load-bearing. Removing 1-shot examples from the SUC prompts caused a 30.38% overall accuracy drop across all tasks. The model's structural understanding is heavily scaffolded by demonstration, not genuinely internalized.
  • The human-LLM gap on real table QA is enormous. TableBench (arXiv:2408.09174, AAAI 2025) evaluates 886 questions across fact checking, numerical reasoning, data analysis, and visualization. Human accuracy is 85.91%. GPT-4-Turbo scores 40.38%, GPT-4o scores 42.73%. The best current models are performing at roughly half the human level on a benchmark designed to reflect real-world table complexity.
  • Complexity collapse on financial spreadsheets is severe. FinSheet-Bench (arXiv:2603.07316) tests LLMs on private equity fund templates with varying structural complexity. Simple lookups achieve 89.1% accuracy. Complex aggregations drop to 19.6%. The largest test file (152 companies, 8 funds) yields 48.6% average accuracy across all models, down from 86.2% on the simplest file.
  • Long tables break models categorically. The TMLR survey reports that beyond 1000 tokens, GPT-3's performance degrades to near-random. Even 200K context-window models struggle with massive datasets due to the quadratic cost of self-attention over long sequences.

What holds up — and what doesn't

The Sui et al. benchmark is carefully designed and the numbers are believable. The finding that HTML outperforms markdown for structural tasks is counterintuitive — markdown is more compact and LLMs see more of it in training — but aligns with what you'd expect: HTML's explicit tagging gives the model more anchors to navigate structure without inferring it.

What I'm skeptical of: the self-augmentation technique (two-stage prompting where the first prompt asks the model to identify critical values before answering) produces improvements of 0.84–5.68% on downstream benchmarks like TabFact and ToTTo. These are real numbers from real experiments, but they're marginal. The technique doesn't address the fundamental problem — it's a prompt-engineering patch on top of genuinely weak structural understanding.

The TMLR survey has the scope problem common to all surveys: it covers everything from tabular prediction (XGBoost country) to generative table synthesis to QA, which dilutes the analysis. The most actionable section for my purposes is the structured QA track, and even there the survey mostly catalogues methods rather than synthesizing which ones are actually reliable.

The FinSheet-Bench finding that complex aggregations score 19.6% is the most finance-specific alarm bell here. Portfolio aggregation, fund-level rollups, and multi-period comparisons are exactly the operations that make financial reporting non-trivial — and they're exactly where LLMs fall apart.

Why this matters for finance AI

Beancount ledgers are tables. When an autonomous agent reads a ledger to detect anomalies, generate reports, or decide on a write-back, it is performing tabular reasoning. The evidence suggests that current LLMs handle simple lookups reasonably well (cell retrieval at 73% for GPT-4) but collapse on the operations that matter most: multi-step aggregation, size estimation for large ledgers, and reasoning over structural variations.

The serialization finding has immediate practical implications. If I'm piping Beancount files into an LLM, the format I choose affects accuracy by several percentage points before I've written a single line of agent logic. Beancount's native syntax is close to the "NL+Sep" end of the format hierarchy — readable for humans, suboptimal for LLMs. Converting to a more structured intermediate (a JSON or HTML table of transactions) before feeding to a model may be worth the preprocessing cost.

The complexity collapse at scale is the most sobering finding. A real Beancount ledger for a small business might have thousands of transactions, dozens of accounts, and multi-year history. The FinSheet-Bench results suggest that once a table grows to the size where it actually matters, LLM accuracy degrades into territory that is not safe for autonomous write-back.

  • TableLLM (arXiv:2311.09206) — fine-tuned model trained on 169 Kaggle tables (UniPredict); reported to substantially outperform GPT-4 zero-shot on tabular prediction, which suggests domain-specific fine-tuning is still the right approach for finance-specific table tasks.
  • TAT-QA (arXiv:2105.07624) — a dataset specifically for discrete reasoning over hybrid financial documents (tables + text, like earnings reports); the accompanying TAT-LLM model is the most direct precedent for applying specialized models to financial table reasoning.
  • ToRR: A Benchmark for Table Reasoning and Robustness (arXiv:2502.19412) — focuses on adversarial perturbations like row shuffling and column reordering; if a Beancount agent is robust to reordering, it's a signal that it understands structure rather than position.