Skip to main content

CRITIC: Why LLM Self-Correction Requires External Tool Feedback

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

Reading CRITIC (Gou et al., ICLR 2024) while thinking about what happens after a finance agent makes a mistake. Reflexion told us agents can learn from failure over episodes. CRITIC asks a sharper question: can an LLM catch and fix its own errors within a single generation pass — and if so, what does it actually need to do that?

The paper

2026-04-26-critic-llm-self-correct-tool-interactive-critiquing

CRITIC introduces a framework in which a language model generates an initial output, then iterates through a verify-then-correct loop using external tools — a search API for factual claims, a Python interpreter for code and arithmetic, and a toxicity classifier for content moderation. The loop runs for a fixed number of iterations (the paper reports effective results in around three corrections), producing a refined output that the authors evaluate on free-form question answering (TriviaQA, AmbigNQ, HotpotQA), mathematical program synthesis, and toxicity reduction.

The central claim is not that LLMs can self-correct on their own. It is almost the opposite: the value of CRITIC comes precisely from grounding the critique in an external signal the model cannot fake. Without the search API, the QA improvements shrink to near zero or reverse. The framework works because the tool tells the model something it genuinely did not know, not because the model becomes a reliable self-auditor.

Key ideas

  • Applied to ChatGPT, CRITIC achieves 7.7 F1-score improvements averaged across three open-domain QA tasks and 7.0 percentage point absolute gains on three mathematical reasoning benchmarks.
  • Toxicity reduction is the most striking single result: a 79.2% reduction in toxicity probability on the evaluated dataset.
  • Removing the search API causes QA performance to either plateau or degrade — the model's intrinsic self-critiquing ability is close to useless for factual tasks.
  • The loop converges quickly: three correction rounds capture most of the gain, with diminishing returns beyond that.
  • The framework is model-agnostic and requires no fine-tuning; it works on black-box APIs including both Text-Davinci-003 and ChatGPT.
  • CRITIC outperforms self-consistency (majority voting over multiple samples) on most tasks, which is significant because self-consistency has no per-step tool cost.

What holds up — and what doesn't

The core empirical result is solid: external tool feedback meaningfully improves outputs, and the ablation removing the search API is damning for naive self-correction advocates. The paper is also honest about the mechanism — the gains come from the tool, not from some emergent metacognitive capacity.

What I find underexplored is the failure mode taxonomy. When does the model generate a bad critique that leads it further from the correct answer? The paper reports average performance, but variance across tasks and question types would matter enormously for deployment. In a financial context, the worst outcome is not "no improvement" — it is a plausible-sounding correction that introduces a new error.

The choice to cap at three iterations is also presented as a practical convenience rather than a principled stopping criterion. Three rounds may work for TriviaQA where there is a ground-truth answer to converge toward. In a domain like ledger reconciliation, where the "correct" answer requires multi-document reasoning and domain knowledge, it is not obvious that three tool calls suffice — or that a general-purpose search API provides the right verification signal at all.

The companion ICLR 2024 paper "Large Language Models Cannot Self-Correct Reasoning Yet" (Huang et al., arXiv:2310.01798) confirms CRITIC's own finding from the other direction: without external feedback, self-correction reliably degrades reasoning accuracy. These two papers together form a coherent picture — the capacity people were calling "self-correction" is mostly external-feedback-driven refinement, and the distinction matters.

Why this matters for finance AI

The CRITIC loop maps naturally onto the write-back safety problem in Beancount agents. Right now, when an LLM agent proposes a journal entry — say, categorizing a transaction or splitting an expense — there is no principled way for it to verify its own output before committing to disk. CRITIC's architecture suggests a concrete pattern: generate a candidate entry, then run verification against a tool (a balance-check function, a rules engine, a duplicate detector) and use the tool's output to prompt a revision before the write lands.

The toxicity result is an analogy I find useful to restate: a 79.2% reduction in policy violations does not come from the model internalizing the rules — it comes from a classifier that reports violations back to the model. For a Beancount ledger, the equivalent would be a rule-checker that flags double-counted transactions or category violations, and feeds that signal into the agent's revision pass. The agent does not need to independently know the rules are broken; it needs the tool's signal.

The critical limitation for finance is the search-API dependency. Finance agents need verification tools that are domain-specific: account-balance integrity checks, chart-of-accounts validators, tax-rule lookups. A generic web search is unlikely to catch a misclassified expense. Building the right tool layer for CRITIC-style correction in accounting is where the real engineering work is — and the paper does not address domain-specific tool design at all.

  • "Large Language Models Cannot Self-Correct Reasoning Yet" (Huang et al., 2023, arXiv:2310.01798) — the direct empirical argument that intrinsic self-correction fails; should be read alongside CRITIC since they triangulate the same mechanism from opposite directions.
  • "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., NeurIPS 2023, arXiv:2305.10601) — extends the single-path critique-and-correct idea to a search tree over intermediate steps; relevant for multi-step reconciliation where the agent needs to explore and backtrack.
  • "ToolBench: Facilitating Large Language Models in Mastering 16000+ Real-world APIs" (Qin et al., 2023, arXiv:2307.16789) — examines how agents learn to select and chain tool calls, which is the upstream problem CRITIC takes for granted.