FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks
FinTrace (arXiv:2604.10015) arrives one week after FinToolBench, which I logged last time, and the two papers are in direct conversation with each other. Where FinToolBench measures whether an agent calls the right tools, FinTrace asks the harder question: even when an agent calls the right tools, does it actually reason over the results? That distinction is the crux of the paper and, I think, the crux of the entire Beancount write-back agent problem.
The paper
Cao et al. introduce FinTrace, a benchmark of 800 expert-annotated trajectories spanning 34 real-world financial task categories across easy, medium, and hard difficulty tiers. The authors construct their evaluation around a rubric of nine metrics organized along four axes: action correctness (tool-calling F1, task relevance), execution efficiency (step efficiency, redundancy score), process quality (logical progression, information utilization, progress score), and output quality (task pass rate, final answer quality). They evaluate 13 LLMs and also release FinTrace-Training, a dataset of 8,196 curated preference trajectories for fine-tuning.
The central claim is that frontier models have mastered tool selection but systematically fail at the harder step: using what the tools return. The benchmark probes this with a 5-point scale for information utilization, logical progression, and progress score, plus algorithmic metrics for tool F1 and step efficiency.
Key ideas
- The best-performing model, Claude-Opus-4.6, achieves a Tool-Calling F1 of 0.896 — strong selection — but scores only 3.23/5 on Information Utilization, the weakest of the four output-facing metrics.
- Claude-Opus-4.6's Task Pass Rate is 2.65/5, and Final Answer Quality is 3.34/5; even the top model does not consistently produce correct, complete answers.
- Qwen-3.5-9B exhibits a degenerate pattern: near-perfect Step Efficiency (1.000) and Redundancy (1.000) because it barely calls any tools, reflected in a Tool-Calling F1 of 0.109. Efficient but useless.
- Training on FinTrace-Training improves intermediate process metrics (Logical Progression rises from 2.29 to 2.56 with DPO; Progress Score from 2.00 to 2.30), but Final Answer Quality stays bottlenecked — no variant significantly exceeds 1.21 average on the 1–5 scale for small models.
- DPO outperforms SFT at suppressing catastrophic failure modes: the share of Logical Progression scores of 1 drops from 11.9% (SFT) to 9.5% (DPO).
- The universally worst sub-category across all 13 models is Reasoning QA, where Claude-Opus-4.6 achieves only 0.62 overall — a hard ceiling shared even by the strongest frontier model.
What holds up — and what doesn't
The core finding — that tool selection and tool reasoning are dissociable — is well motivated and the four-axis rubric is a genuine contribution. Prior benchmarks like FinToolBench stop at execution traces; FinTrace adds LLM-judged process quality metrics that expose what happens in between. The inter-rater Cohen's κ of 0.89 on 100-sample validation is encouraging for a benchmark built partly on LLM judges.
That said, several methodological choices limit what I can take from the numbers at face value. The 34 task categories are not enumerated in the main paper — they're deferred to Appendix B — so I can't tell how representative they are of real-world financial practice. The difficulty tiers are defined by percentile ranks within the benchmark's own query pool, which is a circular measure: hard just means unusual relative to the other 800 trajectories, not hard in any absolute sense.
The fine-tuning analysis is frustrating. Training a 9B model on FinTrace-Training improves intermediate reasoning but final answer quality stays broken. The paper attributes this to a "disconnect" between process and output, but doesn't explain why. The most plausible explanation — that a 9B model lacks the factual recall and arithmetic capability needed for finance tasks regardless of trajectory quality — is left unaddressed. Showing DPO results only for Qwen-3.5-9B also makes it impossible to know whether larger models benefit more.
I'm also skeptical of the overall score aggregation. Combining algorithmic metrics (F1 ∈ [0,1]) with LLM-judged scores on 1–5 Likert scales by normalizing to [0,1] and averaging conflates very different failure types. A model that calls the wrong tools entirely is not the same kind of broken as a model that calls the right tools and then ignores the output.
Why this matters for finance AI
The core finding maps directly onto the Beancount write-back problem. An agent that reliably calls the right Beancount CLI tools but then misinterprets the output — say, parsing a balance sheet response and posting to the wrong account — is worse than no automation: it produces confidently wrong ledger entries that look correct to a casual reviewer.
The Information Utilization metric is the one I'd watch most carefully for any Beancount agent. The fact that the best available model scores 3.23/5 on this in a controlled financial benchmark should be a forcing constraint on any production deployment. It argues for mandatory human review of any write-back operation, at least until we see that score consistently above 4.0.
FinTrace also confirms what ReDAct suggested last week: the right architecture is not end-to-end LLM reasoning but a pipeline that externalizes verification. An agent that selects tools well (Tool F1 ~0.9) and then passes results to a separate validation step before acting is more defensible than one that tries to reason over raw tool output in a single pass.
What to read next
- FinMCP-Bench (arXiv:2603.24943): the companion paper using MCP as the tool interface standard, next on the reading list — directly comparable to FinTrace but built on a different protocol layer
- "Benchmarking LLM Tool-Use in the Wild" (arXiv:2604.06185): appeared simultaneously and evaluates tool calling outside finance; would clarify whether the information-utilization gap is domain-specific or general
- "Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA" (arXiv:2604.05387): targets the same tool-calling failure modes from a training-data perspective and may explain what FinTrace-Training's DPO is missing
