Skip to main content

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

The trajectory of financial LLM benchmarks keeps expanding scope, and Fin-RATE is the clearest example yet of what happens when we finally ask models to do what real analysts do: track a company not just within one filing, but across multiple periods and against its industry peers.

The paper

2026-06-29-fin-rate-real-world-financial-analytics-tracking-evaluation-benchmark

Fin-RATE, published in February 2026 by Yidong Jiang, Junrong Chen, and colleagues at Yale and collaborating institutions, introduces a benchmark built from 2,472 SEC filings across 43 companies and 36 industries spanning 2020–2025. The benchmark organizes 7,500 expert-curated QA pairs into three task types that mirror professional analyst workflows: DR-QA (detail and reasoning within a single filing), EC-QA (cross-entity comparison of two companies under a shared topic), and LT-QA (longitudinal tracking of the same firm across reporting periods). Each task type contains 2,500 questions. The evaluation spans 17 LLMs—closed-source models including GPT-4.1 and GPT-5, open-source general models like DeepSeek-V3 and Llama-3.3-70B, and finance-specialized models like Fin-R1, Fino1-14B, FinanceConnect-13B, and TouchstoneGPT-7B. Scoring uses a unified LLM-as-Judge framework with three independent judges (GPT-5, DeepSeek-V3.2, Qwen3-235B) rating each response on correctness and five analytic dimensions.

Key ideas

  • Performance collapses as task complexity grows: accuracy drops 18.60% from single-document DR-QA to longitudinal LT-QA and 14.35% from DR-QA to cross-entity EC-QA, averaged across all 17 models.
  • GPT-5 with web search is the top performer, yet its peak accuracy sits at only 43–44% across all three task types—dismal for a benchmark meant to mirror real analyst workflows.
  • Fin-R1, the finance-specialized reasoning model, reaches 57.48% on DR-QA but collapses to 3.32% on EC-QA—a 54-point fall that far exceeds any general model's degradation.
  • Under RAG settings, performance across all models falls well below 27%, compared to gold-context performance of up to 57.48%; the retrieval pipeline, not the LLM, is the binding bottleneck.
  • The paper introduces a 13-type error taxonomy across four categories: hallucination and contradictions, finance-specific numerical and semantic errors, query/context understanding errors, and retrieval-level failures. Missing Evidence accounts for 75.44% of errors on the EC-QA task under RAG.
  • Finance-specialized models show systematically higher hallucination rates than general models on complex tasks, despite better financial terminology.

What holds up — and what doesn't

The three-pathway structure is genuinely well-designed. Most financial benchmarks (FinQA, TAT-QA, FinanceBench) treat QA as a single-document task. Fin-RATE is one of the first to explicitly model cross-entity comparison and longitudinal tracking as first-class tasks, and the results expose a fundamental gap: current LLMs handle isolated disclosure QA tolerably but fall apart the moment they need to synthesize across documents, entities, or time periods.

The Fin-R1 collapse is the paper's most striking finding and I think it is underappreciated. A finance-tuned model that excels at single-document extraction apparently trained itself into a corner: it learned templates for answering within one document, not reasoning strategies for relating entities and time periods. This is a concrete warning against narrow domain fine-tuning without explicit multi-document reasoning supervision. The model likely overfits to the shallow pattern of "find the number in the filing" and has no generalization path to "compare this number to the equivalent number in another filing from another company."

That said, there are methodology concerns worth flagging. GPT-5 is simultaneously one of the models being evaluated and one of the three judges scoring answers. The authors use three judges to reduce individual bias, which helps, but the judge-model overlap with the strongest evaluated model is uncomfortable. The paper reports high inter-judge agreement but does not separately quantify what fraction of GPT-5 responses GPT-5 itself scored, nor whether GPT-5's self-assessed scores differ systematically from the other two judges. Any self-evaluation bias would inflate the top-line result for the best-performing model in the study.

The 43-company sample is also thin. The filing type coverage is commendably broad (10-K, 10-Q, 8-K, 6-K, DEF 14A, and several S and SC series), but the same 43 companies appear across all tasks. Models that have seen these companies' disclosures in pre-training have an unquantified advantage, and the paper does not include any contamination analysis.

The retrieval finding is important but incomplete. The paper identifies that RAG performance collapses by roughly 30 points versus gold context because retrieval fails. But it benchmarks only a single retrieval setup—it treats retrieval failure as a diagnosis rather than something to systematically vary. A follow-on paper that sweeps retrieval architectures on Fin-RATE would be far more actionable.

Why this matters for finance AI

Beancount ledger audit needs exactly the two capabilities Fin-RATE reveals are broken: longitudinal tracking (how did this account evolve over fiscal years?) and cross-entity comparison (does this subsidiary's balance sheet reconcile against the consolidated statement?). The 18.60% accuracy drop under temporal tracking is a concrete number that should calibrate expectations for any Beancount agent reasoning across multiple reporting periods. If frontier models fail at 43% under gold-context longitudinal SEC QA, a Beancount agent navigating multi-year ledger histories should be designed with explicit retrieval, temporal grounding, and human escalation—not end-to-end LLM inference.

The retrieval dominance finding matters most for system design priority. If gold-context performance is nearly double RAG performance, the right investment is in better chunking, passage selection, and retrieval—not a more capable backbone LLM. This mirrors what DocFinQA found for long-context SEC filings: the pipeline around the model is the bottleneck.

The Fin-R1 warning also applies directly to the Beancount use case. Fine-tuning on Beancount DSL syntax and transaction patterns may produce a model that handles simple entry generation well but breaks under the multi-account, multi-period reconciliation that makes audit useful. Specialization without multi-document reasoning training is fragile in exactly the ways Fin-RATE measures.

  • Fin-R1 (arXiv:2503.16252) — to understand what training setup produced such brittle cross-document performance, and whether multi-document reasoning was ever in scope.
  • FinTrace (arXiv:2604.10015) — trajectory-level evaluation of LLM tool calling across 34 financial task categories; complements Fin-RATE's static QA view with a process-level diagnostic of where models invoke the right tools but fail to reason over the results.
  • OpenHands (arXiv:2407.16741) — the open agent platform underlying TheAgentCompany evaluations; understanding its architecture clarifies which baseline agent capabilities were available and which gaps are attributable to task difficulty rather than platform limitations.