FinBen: Benchmarking LLMs Across 36 Financial Tasks — Implications for Accounting AI

April 15, 2026 · 5 min read

Tian Pan

Research Engineer

FinBen landed at NeurIPS 2024 as the most comprehensive public evaluation of LLMs on financial tasks to date. I've been wanting to read it carefully because before designing any autonomous agent over Beancount ledgers, I need a realistic picture of where frontier models actually stand on the financial reasoning tasks such an agent would need to perform.

The paper

2026-04-15-finben-financial-llm-benchmark

Qianqian Xie and 33 co-authors present FinBen, an open-source benchmark covering 36 datasets across 24 financial tasks, organized into seven dimensions: information extraction, textual analysis, question answering, text generation, risk management, forecasting, and decision-making. They evaluate 15 representative LLMs — including GPT-4, ChatGPT, Gemini, and several instruction-tuned open-source models — and introduce three new datasets for summarization, QA, and stock trading evaluation.

The central motivation is that prior financial benchmarks like FLUE and FLARE each captured a slice of financial NLP but nothing close to the full pipeline. FinBen is the first attempt to span the whole stack in one place, and it was accepted into the NeurIPS 2024 Datasets and Benchmarks Track, which gives it a reasonable stamp of methodological scrutiny.

Key ideas

On named entity recognition, GPT-4 scores 0.83 Entity F1 on the FINER-ORD dataset — strong, but this is the easiest category in the benchmark.
On FinQA (numerical reasoning over financial reports), GPT-4 reaches 0.63 Exact Match; on the conversational variant ConvFinQA, it scores 0.76. These are respectable but far from solved.
Domain-fine-tuned FinMA 7B achieves 0.88 F1 on FPB sentiment — outperforming GPT-4 on this narrow task, confirming that fine-tuning still earns you something on well-defined classification.
Stock movement forecasting is the clearest failure mode: even GPT-4 scores roughly 0.54 accuracy — barely above random. The authors call this "a notable deficiency in LLMs' capacity to tackle forecasting."
GPT-4 achieves a Sharpe Ratio of 1.51 on the trading task versus 1.03 for Gemini and a cumulative return of 28.19% against a buy-and-hold return of −4.00% during the evaluation period — but this is a short backtest with all the usual caveats.
All models scored zero on extractive summarization, and GPT-4 scored 0.01 F1 on relation extraction. Capabilities collapse sharply outside the comfort zone of text classification and open-ended generation.

What holds up — and what doesn't

The benchmark is genuinely useful as a survey instrument. The range of tasks is broader than anything that came before it, and the open-source release means others can build on the evaluation infrastructure rather than starting over.

That said, I have real concerns about what FinBen can actually tell you. The trading evaluation period is short and market-specific; a Sharpe Ratio computed over a few months on US equities is not a stable signal. The zero scores on extractive summarization tell us something is broken, but the paper doesn't diagnose why — is it a prompt format issue, a tokenization artifact, or a genuine reasoning failure? The distinction matters for anyone trying to fix it.

The benchmark is also almost entirely English and US-market-centric. This isn't just a generalization caveat; it means the results tell you very little about performance on, say, German or Chinese financial documents, or on jurisdictions with different accounting standards. For a project like Beancount.io serving a global user base, this is a significant gap.

The instruction-tuned model story is also muddier than it first appears. Fine-tuning helps on sentiment (FinMA 7B at 0.88) but "provides only marginal improvements for complex tasks like QA." The paper reports this as a finding but doesn't offer a mechanistic explanation. Is it catastrophic forgetting on the base model's reasoning ability? Is the fine-tuning data distribution too narrow? The benchmark surface area alone can't answer this.

Why this matters for finance AI

The FinBen results give Bean Labs a cleaner baseline than we had before. The tasks most relevant to a Beancount ledger agent — numerical QA over structured financial reports (FinQA: 0.63 Exact Match), information extraction from transaction descriptions (NER: 0.83 F1), and anomaly detection or fraud classification (risk management tasks showing wide variance) — are all represented here, and none of them is solved.

The forecasting collapse (0.54 on stock movement) is actually reassuring for our narrower use case: we're not asking models to predict markets, we're asking them to classify, extract, and write back structured entries. Those tasks land in the 0.63–0.83 range depending on complexity, which is a workable foundation — though "workable" is not "production-safe without human review."

The gap between structured extraction and open-ended reasoning also maps directly onto the write-back safety problem. If a model can reliably extract an entity (F1 0.83) but struggles to reason about its numerical implications (FinQA 0.63) or generate correct structured output (relation extraction: 0.01), then the safest architecture keeps those steps separate, with explicit validation between them.

FinBen: Benchmarking LLMs Across 36 Financial Tasks — Implications for Accounting AI

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next