Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models
The pressure on autonomous agents to be both cheap and reliable pulls in opposite directions: frontier models are reliable but expensive, small models are cheap but error-prone. Piatrashyn et al.'s ReDAct paper (arXiv:2604.07036) proposes a middle path — run a small model by default and defer to a large model only when the small model is uncertain. I'm reading it because the same tension defines every production Beancount write-back agent: you want the system to handle routine categorization cheaply and to escalate non-obvious cases before they corrupt the ledger.
The paper
ReDAct (Reason-Defer-Act) builds on the ReAct prompting paradigm and introduces a two-model agent architecture. A small cheap model — Qwen3-80B, Llama3.3-70B, or Llama4-Maverick — handles every step by default. At each step it generates a reasoning trace, then generates an action. The system measures token-level uncertainty over the action generation step only and compares it against a calibrated threshold. If uncertainty exceeds that threshold, the step is re-run by a large expensive model (GPT-5.2, Qwen3-235B, or Qwen3-480B); otherwise the small model's action is executed.
The uncertainty measures are information-theoretic and require only token-level log-probabilities: Sequence Probability (summed negative log-prob), Perplexity (length-normalized), and Mean Token Entropy (average entropy across token positions). The threshold is calibrated from a held-out set of small-model rollouts by choosing the value that produces a target number of large-model calls per episode K.
Key ideas
- Measure uncertainty at the action step, not the reasoning step. An auxiliary experiment on 2,411 ALFWorld steps found that reasoning-level uncertainty has poor discriminative power between correct and incorrect steps; action-level perplexity has measurably higher ROC-AUC and PRR as a predictor of correctness.
- PPL deferral with Qwen3-80B + GPT-5.2 achieves 80.8% ± 1.1% on ALFWorld, exceeding GPT-5.2 alone at 78.3% ± 1.9%, while costing $16.25 vs $45.21 — approximately 64% cheaper.
- ~15% of steps are deferred in practice to match a calibration target of roughly 10%; the gap arises because failed (shorter) trajectories contribute disproportionately to the deferral budget.
- Random deferral at the same rate scores 77.0% — still better than small-model-only (68.3%), but worse than UQ-guided deferral. The uncertainty signal genuinely matters, not just the act of calling the large model more.
- MiniGrid shows less headroom. Qwen3-80B + GPT-5.2 with PPL deferral reaches 95.0% vs 99.0% for GPT-5.2 alone. The smaller task vocabulary creates a harder ceiling for the deferral approach when the small model is structurally inadequate.
- Deferral distribution is task-dependent. ALFWorld defers more in later steps (longer prompt history), while MiniGrid shows a bimodal pattern tied to initial agent position. This means fixed threshold calibration generalises better within a task family than across task families.
What holds up — and what doesn't
The core empirical finding is credible: perplexity over the action string is a reasonable proxy for whether a given step is about to go wrong. The reasoning/acting decomposition in ReAct naturally provides a clean point to attach an uncertainty signal, and the auxiliary correctness-prediction experiment gives genuine mechanistic justification for the design choice.
What I'm less convinced by: the "exceeds large-model-alone" result on ALFWorld. 80.8% ± 1.1% vs 78.3% ± 1.9% overlap at one standard deviation. The authors attribute this to complementary strengths — the small model handles routine steps without the large model's occasional risk-taking — but there is no per-step ablation to verify this narrative. It could just as easily be noise.
The benchmark choice is also limiting. ALFWorld and MiniGrid are text-based household simulation and grid-world navigation — narrow environments that do not exercise tool calling, code execution, or multi-document retrieval. Whether uncertainty-calibrated deferral holds in those richer settings (the settings relevant to Beancount) is unanswered. And the choice of GPT-5.2 as the large model makes the cost numbers hard to reproduce.
The calibration procedure has an unaddressed circularity: the threshold is selected on the same distribution it was calibrated on, with no held-out validation. The authors acknowledge distribution shift between calibration (small-model rollouts) and evaluation (hybrid rollouts), but leave threshold robustness to future work.
Why this matters for finance AI
Beancount write-back agents face exactly the same deferral question at every transaction. A routine grocery purchase needs categorisation; an unusual multi-leg foreign-currency swap with a partially matched memo needs a human. The current practice is either full automation (risky) or full human review (expensive). ReDAct's framework suggests a tractable middle ground: run the cheap model and escalate when perplexity over the candidate journal entry exceeds a calibrated threshold.
The finance context adds two considerations the paper doesn't address. First, deferral here should often mean pausing and asking the user, not calling a larger LLM — the ledger's correctness standard is the user's intent, not a benchmark score. Second, the irreversibility of a committed Beancount entry is higher than a misplaced object in ALFWorld. The calibration target K should probably be tuned conservatively toward lower precision on the small model before deferring, not the other way around.
The 64% cost reduction signal is worth taking seriously even with those caveats. If a Beancount agent processes a month of transactions and only 15% of categorisation decisions need the expensive model, the economics of running a capable write-back agent look much better.
What to read next
- KnowNo (Ren et al., 2023, CoRL): "Robots that ask for help: uncertainty alignment for large language model planners" — uses conformal prediction to calibrate a coverage guarantee on when to ask for help. ReDAct does not compare against it; understanding the trade-off between conformal guarantees and threshold calibration matters before choosing a production approach. [arXiv:2307.01928]
- A Survey of Confidence Estimation and Calibration in Large Language Models (Guo et al. updated, NAACL 2024) — systematic taxonomy of verbalized confidence, sampling-based, and post-hoc calibration methods; the theoretical background for deciding whether perplexity is the right uncertainty proxy or whether calibrated logit scaling would perform better. [arXiv:2311.08298]
- UALA: Uncertainty-Aware Language Agent (Han, Buntine, Shareghi) — applies a structurally similar uncertainty threshold to the tool invocation decision (call a tool vs rely on model knowledge), reducing tool calls by over 50%; the direct complement to ReDAct for the tool-use axis of agent uncertainty. [https://uala-agent.github.io/]
