Skip to main content

FinMaster Benchmark: Why LLMs Score 96% on Financial Literacy but 3% on Statement Generation

· 5 min read
Mike Thrift
Mike Thrift
Marketing Manager

The FinMaster paper landed in my reading queue right after ReAct. If ReAct is about how agents decide when to act, FinMaster asks a harder question: how well do today's best LLMs perform on the actual accounting workflows those agents need to execute? Submitted in May 2025, it's the first benchmark I've seen that covers the full pipeline—financial literacy, accounting, auditing, and consulting—in one coherent evaluation framework.

The paper

2026-04-18-finmaster-financial-workflows-llm-benchmark

Jiang et al. introduce FinMaster (arXiv:2505.13533), a three-part benchmark for evaluating LLMs on financial workflows. The first component, FinSim, is a synthetic data generator that simulates five types of companies and produces ledger transactions—both correct and deliberately erroneous—to populate test scenarios without real-world data privacy concerns. The second, FinSuite, bundles 183 tasks spanning financial literacy, accounting, auditing, and consulting at varying difficulty levels. The third, FinEval, provides a unified scoring interface. Together, the authors claim FinMaster is the first benchmark to cover the full financial pipeline with infinite, privacy-safe data generation—a claim that holds up when compared to static predecessors like FinBen and FinanceBench.

Key ideas

  • The cliff at complexity: Models score ~96% average on financial literacy (reading balance sheets, income statements), then fall to 40–60% on basic accounting calculations, below 20% on multi-step accounting tasks, and just 3% on financial statement generation. Literacy and computation are not the same skill.
  • Error propagation is severe: In consulting tasks, single-metric calculations averaged 58% accuracy; multi-metric scenarios that chain those calculations dropped to 37%—a 21-point fall from compounding small errors.
  • The leaderboard is tight at the top: o3-mini (0.73 average), Claude-3.7-Sonnet (0.72), and DeepSeek-V3-2503 (0.70) are clustered closely, suggesting the benchmark is non-trivial but not yet a ceiling.
  • Accounting is the hard domain: Across all seven evaluated models, accounting scores ranged from only 0.04 to 0.35—far below any other category. Statement generation at 3% means LLMs cannot yet reliably synthesize a transaction journal into a coherent financial statement.
  • Reasoning models help at the margins: o3-mini leads overall, but not decisively. Chain-of-thought style reasoning is real but cannot bridge the 93-point gap between literacy and statement generation.
  • FinSim enables stress-testing at scale: Prior benchmarks use static, fixed datasets vulnerable to contamination over time. FinMaster can generate new scenarios on demand, which matters for studying whether models generalize or merely memorize.

What holds up — and what doesn't

The core result—that multi-step financial reasoning degrades sharply—is credible and matches patterns from LOG-001 (FinBen) and LOG-002 (Toolformer). I believe the error propagation finding; it's structurally similar to what happens in any arithmetic chain. The FinSim generator is a genuine methodological contribution: a benchmark that can generate fresh scenarios resists the memorization problem that plagues static financial datasets.

What I'm less convinced by: 183 tasks is thin for a benchmark claiming holistic coverage. Thirty-five auditing tasks cannot characterize a domain as broad as financial auditing, where real-world error taxonomies have hundreds of entries. The paper collapses the whole domain into 12 basic error types, which obscures the heterogeneity of actual audit findings.

The single aggregate leaderboard score also conceals important cross-domain patterns. Auditing and consulting have very different model-by-model profiles, and averaging them produces a number that is easy to quote but hard to act on.

The synthetic data limitation is a double-edged sword. FinSim generates clean, well-structured ledger data. Real accounting systems carry decades of legacy encoding choices, currency rounding artifacts, and off-cycle adjustments that no simulator captures. A 3% score on synthetic statement generation is grim; the same measurement on a real company's messy books would likely be grimmer still. The paper is also text-only—the authors acknowledge the multimodal gap but don't measure it. Most accounting work actually lives in scanned PDFs and spreadsheets.

Why this matters for finance AI

This is the most directly relevant paper I've read since FinBen for the Bean Labs agenda. The Beancount use case is essentially a subset of what FinMaster evaluates: transaction-level accounting, multi-step calculations, and report generation. The 3% on statement generation is a sobering number. It tells me that even with a well-designed ReAct agent scaffold, the underlying model's ability to synthesize a correct Beancount balance sheet from a transaction journal is unreliable without specialized fine-tuning or retrieval scaffolding.

The error propagation result is directly relevant to write-back safety. If a consulting task chain loses 21 points of accuracy from step one to step two, then an autonomous Beancount agent performing a three-step reconciliation is compounding errors at each stage. This is a strong argument for breaking agent tasks into the smallest possible atomic operations and verifying intermediate results rather than relying on end-to-end LLM reasoning.

FinSim also suggests a concrete direction for Bean Labs: a Beancount-specific transaction simulator could generate labeled test cases for evaluating and fine-tuning models on ledger operations. The architecture is already there; the domain just needs to be ported.

  • Financial Statement Analysis with Large Language Models (Alex Kim, Maximilian Muhn, Valeri Nikolaev; arXiv:2407.17866) — tests GPT-4's ability to predict earnings direction from financial statements, achieving parity with narrow ML models; a useful counter-data-point to FinMaster's grim numbers on statement generation.
  • FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark (arXiv:2510.08886) — more granular auditing evaluation with multi-document reasoning; complements FinMaster's sparse 35-task auditing coverage.
  • AuditBench: A Benchmark for Large Language Models in Financial Statement Auditing (Springer 2025) — pairs synthesized transaction data with real financial tables to test error detection and explanation; directly comparable methodology to FinMaster's auditing module.