Skip to main content

Gorilla: How Retrieval-Aware Training Reduces LLM API Hallucinations from 78% to 11%

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

Reading the Gorilla paper (Patil et al., 2023, arXiv:2305.15334, NeurIPS 2024) because it sits at the junction of two problems I keep running into: how do we get an LLM agent to call the right tool with the right arguments, and how do we keep that ability alive as APIs change? The answers here are practical and the numbers are surprisingly strong — but the assumptions baked into the evaluation deserve more scrutiny than they usually get.

The paper

2026-05-03-gorilla-llm-retrieval-augmented-api-calling

Gorilla, by Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez at UC Berkeley, addresses a concrete failure mode: state-of-the-art LLMs hallucinate API calls. When asked to write code that invokes a specific library function, GPT-4 (as of mid-2023) frequently generates plausible-looking but wrong function signatures, non-existent models, or deprecated argument names. Gorilla is a 7-billion-parameter LLaMA-based model fine-tuned specifically to generate accurate API calls, trained with a technique the authors call Retriever-Aware Training (RAT). The idea is simple: during training, the model is shown retrieved API documentation alongside the user query, formatted as "Use this API documentation for reference: <retrieved_API_doc_JSON>". This teaches the model both to read documentation and to trust retrieved context over its parametric memory — a property that pays dividends at inference time when documentation has changed.

The evaluation dataset, APIBench, covers 925 HuggingFace Model Hub APIs, 95 TorchHub APIs, and 696 TensorFlow Hub APIs, with ten synthetic instruction-following queries generated per API via self-instruct. The evaluation metric is AST sub-tree matching — the generated API call is parsed and checked for functional correctness — which also enables, for the first time in this setting, a principled measurement of hallucination rate.

Key ideas

  • RAT makes documentation readable at inference time. By training on prompts that include retrieved documentation, Gorilla learns to defer to the retrieved text rather than recall API details from weights. This means the model stays current as APIs evolve without retraining.
  • Zero-shot accuracy: Gorilla 59–84%, GPT-4 18–39%. On TorchHub, Gorilla achieves 59.13% versus GPT-4's 38.70%. On HuggingFace, it's 71.68% vs 19.80%. On TensorFlow Hub, 83.79% vs 18.20%. The margin is largest where the API space is most diverse.
  • Hallucination reduction is the headline. Gorilla's hallucination rate is 6.98% on TorchHub, 10.95% on HuggingFace, and 5.40% on TensorFlow Hub. GPT-4's rates range from 36.55% to 78.65% on the same datasets.
  • The oracle retriever is the ceiling. With the ground-truth document retrieved (oracle mode), accuracy reaches 67–94%. This is the theoretical best case for any RAG-based system and the gap from zero-shot Gorilla to this ceiling is the room available for retriever improvement.
  • Real retrievers fall short. Switching from the oracle to GPT-Index at evaluation time degrades accuracy by 29.20%; BM25 degrades it by 52.27%. The model's robustness to retrieval noise is real, but not unlimited.
  • AST evaluation generalizes. The sub-tree matching approach measures whether the generated call is functionally correct, not just syntactically similar. This is the right metric for any task where the output is code that will actually execute.

What holds up — and what doesn't

The core claim holds: fine-tuning on documentation-augmented prompts dramatically improves API call accuracy and cuts hallucination. The AST evaluation methodology is genuinely novel and clearly better than string matching or human evaluation at scale. RAT is a clean, reproducible idea.

What I'm skeptical about is the scope of the benchmark. All three datasets — HuggingFace, TorchHub, TensorFlow Hub — are ML model registries with a very regular API structure: you load a model by name, possibly with a few keyword arguments, and call a predict-like method. The instructions are synthetically generated, which means the test distribution is closely related to the training distribution. A model tuned on ML API documentation trained via self-instruct, evaluated on self-instruct queries for ML APIs, is not being tested on the hardness that actually shows up in production: ambiguous requests, multi-step workflows, argument type coercion, authentication, rate limits, or error recovery.

The retrieval degradation is also bigger than the paper's framing suggests. A 52% accuracy drop with BM25 retrieval is catastrophic. If the retriever you deploy in production resembles BM25 more than an oracle, the gains evaporate. The authors acknowledge this gap but don't offer a path to close it.

Finally, the model itself is a 7B LLaMA fine-tune. The comparison to GPT-4 zero-shot is striking but the comparison isn't quite fair: GPT-4 has not been trained to use retrieved documentation. A RAG-augmented GPT-4 with a system prompt designed to read API docs would almost certainly close the gap considerably.

Why this matters for finance AI

The RAT pattern is directly applicable to Beancount write-back agents. A Beancount agent needs to invoke CLI commands (bean-query, bean-report), Python APIs (beancount.loader, beancount.core), and the beancount-ledger FastAPI service — each with specific argument semantics that are documented but not necessarily in the model's training data. The Gorilla approach says: retrieve the relevant documentation snippet at inference time, inject it into the context, and train the model to read and follow it.

The hallucination numbers are the most useful signal for a finance context. A 10% hallucination rate on ML model names is annoying. A 10% hallucination rate on ledger mutation calls — wrong account names, wrong currency codes, inverted debit/credit signs — is a correctness problem. The implication is that even a Gorilla-style trained agent needs an execution-time validator before any write is committed, consistent with what CRITIC (LOG-012) showed about tool-interactive critiquing. The retrieval degradation finding reinforces this: if real-world retrieval cuts accuracy by half, the safety net cannot be the retrieval quality alone.

The AST evaluation methodology translates naturally. Beancount transactions have a parseable structure, and checking generated directives against a schema using AST matching is exactly the kind of lightweight validator that could run in a pre-commit hook or agent loop.

  • ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs (arXiv:2307.16789) — extends the API-calling problem to 16,000 real REST APIs with multi-step tool-use chains; directly addresses Gorilla's limitation of evaluating only single-call ML registry invocations
  • The Berkeley Function Calling Leaderboard (BFCL) (OpenReview:2GmDdhBdDk, NeurIPS 2024 poster) — the direct evolution of Gorilla into a living leaderboard tracking how frontier models improve at function calling over time; V3 adds multi-turn interactions, V4 adds agentic web search
  • API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs — evaluates LLMs on 73 APIs across a broader range of domains including finance and web services, with multi-turn tool use; a useful complement to APIBench's narrower ML focus