TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.
Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.
TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.
TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.
MAC-SQL (COLING 2025) uses three specialized agents — Selector for schema reduction, Decomposer for question decomposition, and Refiner for execution-guided SQL correction — to reach 59.59% execution accuracy on the BIRD benchmark; ablation shows the Refiner contributes the most (+4.63 points), with direct implications for Beancount ledger query generation.
DIN-SQL (NeurIPS 2023) decomposes text-to-SQL into schema linking, complexity classification, and SQL generation stages, lifting GPT-4 from 67.4% to 85.3% execution accuracy on Spider without fine-tuning — and the same decomposition strategy maps directly onto natural language interfaces for Beancount's BQL query language.
The BIRD benchmark (NeurIPS 2023) tests LLMs on 95 real databases — GPT-4 reaches only 54.89% execution accuracy with domain hints and 34.88% without, a 20-point gap that directly shapes what a natural-language BQL interface for Beancount would need to solve.
Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.