Skip to main content

Toolformer: Self-Supervised Tool Use and Its Limits for Finance AI

· 6 min read
Tian Pan
Research Engineer

Toolformer (Schick et al., 2023, Meta AI) is the foundational paper for teaching language models to call external APIs through self-supervised training. I've been putting off a careful reading because "tool use" has become such a buzzword that the original claims get muddied. Before designing any write-back agent that calls ledger tools, I need to understand what Toolformer actually demonstrated — and where it quietly fails.

The paper

2026-04-16-toolformer-language-models-teach-themselves-use-tools

Timo Schick and seven co-authors at Meta AI present a method for training a language model to decide when to call external APIs, what arguments to pass, and how to incorporate results into its own predictions — without requiring manually labeled training data for each tool. The approach is self-supervised: the model generates candidate API calls at plausible positions in text, executes those calls, and keeps only the examples where the API result genuinely reduces the model's perplexity on the surrounding tokens. That filtered dataset is then used for fine-tuning. The tools tested include a calculator, two search engines (BM25 retrieval and Wikipedia search), a QA model, a translator, and a calendar.

The trained model is a 6.7B-parameter GPT-J-based model they call Toolformer. The paper was accepted at NeurIPS 2023.

Key ideas

  • On math word problems (SVAMP), Toolformer 6.7B scores 29.4% — compared to GPT-J baseline at 5.2%, OPT 66B at 4.9%, and GPT-3 175B at 10.0%. Tool use effectively collapses the usual scaling curve for arithmetic.
  • On ASDiv math, Toolformer reaches 40.4% vs GPT-J at 7.5% and GPT-3 at 14.0%; on MAWPS, 44.0% vs GPT-J at 9.9% and GPT-3 at 19.8%.
  • On factual QA tasks the picture reverses: GPT-3 still outperforms Toolformer on all three QA benchmarks (TriviaQA, WebQuestions, Natural Questions) despite Toolformer using search tools. Toolformer TriviaQA: 53.5% vs GPT-J baseline 31.9%, but GPT-3 without tools is higher still.
  • The self-supervised data-generation pipeline produces training examples where the model learns not to call an API when it isn't helpful — the filtering step uses perplexity improvement as the signal for "did this tool call actually help?"
  • Tool-use capability only emerges at scale: models below roughly 775M parameters do not reliably learn when to invoke tools, even with the same training signal.
  • The calendar tool is called only 0.2% of the time in temporal reasoning tasks; the model predominantly routes temporal questions to the wiki search tool instead.

What holds up — and what doesn't

The core insight is durable: the perplexity-based filtering trick is elegant because it requires no human labeling and no oracle that knows the right answer — only whether the inserted API result made the surrounding text more predictable. That's a genuine contribution, and the math results are striking. A 6.7B model beating GPT-3 on ASDiv isn't a trick of evaluation; it's a clean demonstration that the right tool call is worth ~26× more parameters on arithmetic tasks.

What I find less convincing is the QA story. The paper presents Toolformer as broadly improving performance, but the QA results show it doesn't beat GPT-3 — a much larger model without any tools. The authors acknowledge this, but the narrative framing ("often competitive with much larger models") understates how selective the victory is: the model wins on tasks that cleanly decompose into a single calculator or lookup call, and loses or matches on tasks requiring genuine reasoning over retrieved content.

The deeper methodological issue is that the self-supervised pipeline assumes the model is already good enough to generate plausible API calls before it's been trained to do so. This is a bootstrapping problem. For well-structured tools like a calculator with a clear input format, it works. For tools with more complex argument schemas — exactly the kind you'd want for a real-world ledger write-back API — the quality of sampled calls would degrade fast.

The paper also evaluates each tool in isolation, not in combination. There's no demonstration of a multi-step pipeline where, say, a search result feeds into a calculator. The authors flag this as a limitation, but it's a significant one: real accounting workflows almost always require chained tool calls.

Finally, the evaluation is zero-shot. There's no comparison to few-shot prompted GPT-3 or GPT-4 with tools provided in context, which became the dominant paradigm within months of this paper. The NeurIPS 2023 publication date means the experiments predate widespread adoption of function-calling APIs, making the comparison set somewhat dated by the time of publication.

Why this matters for finance AI

The Toolformer paper answers a question I care about for Bean Labs: can a model learn to call a write-back API reliably, and at what cost? The answer from the math results is "yes, if the tool interface is clean and the task decomposes into a single call." The failure modes, though, map directly onto the hardest parts of the ledger problem.

Beancount write-back actions — classifying a transaction, inferring account mappings, generating a journal entry — are not single-step calculator calls. They involve retrieving context (prior entries, chart of accounts), applying rules (posting rules, currency constraints), and producing structured output that must be syntactically valid. That's at least three chained tool calls, and the Toolformer architecture explicitly cannot chain tools. The perplexity-based training signal would also be hard to apply here: it's not clear what "lower perplexity on the surrounding ledger text" means when the output is a structured .beancount file rather than natural language continuation.

The more useful lesson from Toolformer for our purposes is the negative space: a write-back agent can't just be a fine-tuned LM that has memorized when to call the ledger API. It needs an explicit reasoning layer (ReAct or similar) that can plan, execute, and check intermediate results before committing a write. Toolformer demonstrates tool use works; it doesn't demonstrate it works safely on structured, side-effecting operations.

  • ReAct: Synergizing Reasoning and Acting in Language Models (arXiv:2210.03629) — adds explicit chain-of-thought reasoning steps interleaved with tool calls; the architecture that addresses Toolformer's chaining limitation and is the basis for most modern agents.
  • ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs (arXiv:2307.16789, ICLR 2024) — scales tool use to over 16,000 real APIs via the ToolBench dataset; the closest thing to a stress test of tool calling at the complexity level a real accounting agent would face.
  • FinMaster (arXiv:2505.13533) — benchmarks end-to-end accounting workflows including journal entry and reconciliation; will show whether the gains Toolformer demonstrated on arithmetic generalize to the multi-step, schema-constrained tasks that matter for Beancount.