Skip to main content

ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

After spending several logs on single-turn financial QA — FinQA, FinanceBench, TAT-QA — I wanted to look at what happens when users ask follow-up questions. ConvFinQA (Chen et al., EMNLP 2022) is the paper that takes the FinQA setting and extends it into multi-turn conversation, and the results expose a failure mode that single-turn benchmarks simply cannot see: models that ace isolated numerical reasoning frequently collapse the moment a question references something said two turns ago.

The paper

2026-05-15-convfinqa-chain-numerical-reasoning-conversational-finance-qa

ConvFinQA, from Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang (UC Santa Barbara and J.P. Morgan), builds a dataset of 3,892 multi-turn conversations totaling 14,115 questions over 2,066 financial report pages. Each conversation is grounded in earnings reports — the same S&P 500 filings used in FinQA — and questions chain together so that later turns can reference earlier answers. The task format is inherited from FinQA: models generate a program in a small domain-specific language (add, subtract, multiply, divide, greater, exp) that is then executed to produce the answer. Evaluation uses execution accuracy (whether the executed result matches the gold answer) and program accuracy (whether the generated program matches the gold program).

The dataset has two conversation types. Type I "simple" conversations decompose a single complex FinQA question into a sequence of sub-questions. Type II "hybrid" conversations concatenate decompositions of two different FinQA questions about the same report, forcing cross-aspect reasoning. Over 60% of questions have dependencies on prior turns, and second-part questions in hybrid conversations are substantially harder because the model must carry reasoning state across different financial topics.

Key ideas

  • Best fine-tuned model (FinQANet with RoBERTa-large): 68.90% execution accuracy on the test set. Human financial experts reach 89.44%. General crowd workers (MTurk): 46.90% — a striking gap that confirms the task requires genuine domain knowledge.
  • GPT-3 (text-davinci-002, 175B) with 20 few-shot exemplars and gold supporting facts: 50.30% execution accuracy — well below the fine-tuned specialist and barely above the crowd.
  • Chain-of-thought prompting hurts GPT-3: CoT yields 40.63% vs. 45.15% for standard program prompting. The model mimics the reasoning format of the given examples instead of applying it to the actual question.
  • Hybrid conversations are substantially harder: the second part of a hybrid conversation scores 52.38% for FinQANet versus 72.37% for simple conversations. Multi-aspect cross-referencing is where current models fall apart.
  • GPT-3 specifically struggles with number selection questions — answering a follow-up like "what about the prior year?" — achieving only 35.32% where FinQANet reaches 82.54%. Conversational anaphora resolution is the bottleneck.

What holds up — and what doesn't

The dataset construction is careful and the evaluation is clean. Using program accuracy alongside execution accuracy is valuable: two programs can produce the same numerical answer by different (possibly wrong) reasoning paths, and program accuracy catches that. The decision to anchor conversations in real S&P 500 filings keeps the task grounded rather than synthetic.

That said, the conversation variety is limited by design. Every conversation is constructed by decomposing existing FinQA questions — there are no truly open-ended dialogues, no clarification turns, no user corrections. Real accounting conversations include all of these. The dataset is a controlled approximation of conversational reasoning, not a naturalistic sample.

The GPT-3 analysis has aged awkwardly. At the time of publication (late 2022), GPT-3 peaking below 50% felt like a meaningful negative result. But the paper predates GPT-4, and subsequent work shows that more capable models close much of the gap. The CoT finding — that prompting backfired — is interesting but may be model-specific: CoT tends to work better in models with stronger instruction following.

The evaluation also focuses entirely on final answer correctness and ignores intermediate reasoning chain quality. This matters because a model can generate a numerically correct answer via a wrong program (which program accuracy partially catches) or a correct program via brittle reasoning that would fail under slight paraphrasing. FinChain (2025) explicitly critiques this, motivating a transparency-focused alternative. For production systems, knowing why the model got the right answer is as important as knowing that it did.

Why this matters for finance AI

A Beancount agent fielding user queries rarely gets a single self-contained question. Users ask "what did I spend on groceries last month?" and then "how does that compare to the month before?" and then "is that more than I budgeted?" Each question builds on the last. ConvFinQA is the closest published benchmark to this interaction pattern, and its numbers are sobering: even with gold retrieval, the best available model in 2022 left a ~21 percentage-point gap to human expert performance, and the gap widens on multi-aspect questions.

The specific failure on hybrid conversations is worth flagging. When a user switches from asking about revenue to asking about expenses in the same session, the model needs to carry forward numerical context while resetting topical focus. That is exactly what a Beancount agent must do across a multi-turn ledger review session. The 52.38% score on those turns is a direct lower bound on how well current approaches handle that scenario.

The CoT finding is also practically useful: it suggests that when prompting a model to reason over financial data in a multi-turn setting, structured program generation may be more reliable than free-form chain-of-thought, at least for models of GPT-3's capability level. More capable models may not show this inversion — but it is a hypothesis to test, not an assumption to make.

  • ConvFinQA APOLLO follow-up (arXiv:2212.07249) — achieves state-of-the-art on ConvFinQA using number-aware negative sampling and consistency-based reinforcement learning; worth reading to see what closed the gap after the original paper
  • Program of Thoughts Prompting (arXiv:2211.12737, 2022) — offloads arithmetic to a Python interpreter rather than a DSL; reported ~12% improvement over CoT on financial QA tasks and near-SoTA on ConvFinQA; connects CodeAct ideas directly to financial reasoning
  • FLARE: Active Retrieval Augmented Generation (arXiv:2305.06983, EMNLP 2023) — retrieves on-demand during generation rather than once at the start; directly relevant to the multi-turn setting where what the model needs to look up changes turn by turn