LLM Confidence and Calibration: A Survey of What the Research Actually Shows
Last week I covered ReDAct, which routes agent decisions to an expensive fallback model when a cheap model's uncertainty exceeds a calibrated threshold. That paper does a lot of hand-waving about "uncertainty" — it's worth pausing to understand what the field actually knows about measuring and calibrating it. Geng et al.'s "A Survey of Confidence Estimation and Calibration in Large Language Models" (NAACL 2024) is the right place to start: a systematic taxonomy of what works, what doesn't, and what nobody has measured yet.
The paper
Geng, Cai, Wang, Koeppl, Nakov, and Gurevych survey the emerging literature on LLM confidence estimation and calibration across tasks ranging from multiple-choice QA to open-ended generation and machine translation. The core problem: LLMs can be both highly accurate and completely unreliable in ways that are hard to distinguish from the outside. The survey organizes the solution space into two main branches — white-box methods that exploit access to internal model states, and black-box methods that treat the model as opaque — and within each, further distinguishes between estimating confidence and calibrating it post hoc.
The paper was published at NAACL 2024 (pages 6577–6595), revised in March 2024 from a November 2023 submission by a team spanning TU Darmstadt, MBZUAI, and Mohamed bin Zayed University of AI.
Key ideas
-
White-box confidence via logits: The simplest approach uses token-level probabilities or length-normalized log-likelihood as a confidence signal. These methods work but face a fundamental ambiguity: low token probability can reflect low factual confidence or simply unusual phrasing — the model may be uncertain about word choice while being certain about the underlying fact.
-
Consistency-based black-box confidence (SelfCheckGPT): Manakul et al. (EMNLP 2023) sample multiple completions and score their mutual consistency using BERTScore, NLI, or n-gram overlap. No logit access needed. The key insight: for facts the LLM knows well, repeated samples converge; for hallucinated facts, they diverge.
-
Semantic entropy: Farquhar et al. (Nature, 2024) cluster semantically equivalent answers before computing entropy. An LLM might phrase "Paris" and "the French capital" differently — raw token entropy treats these as divergent, semantic entropy does not. This is a qualitative step forward over token-level consistency that the survey contextualizes.
-
Verbalized confidence is broken: When asked to output a confidence percentage, models collapse into overconfidence. Empirical work (Groot et al., TrustNLP at ACL 2024) finds that GPT-3, GPT-3.5, and Vicuna all show average Expected Calibration Error (ECE) exceeding 0.377 for verbalized confidence, with predictions clustering in the 90–100% range regardless of actual accuracy. Even GPT-4 — the best-calibrated model evaluated — achieves an AUROC of only ~62.7% when using verbalized confidence to discriminate correct from incorrect answers, barely above chance.
-
Calibration techniques vary by task: For classification, contextual calibration (subtracting class-prior bias estimated with an empty "[N/A]" prompt) and position debiasing (PriDE) address known systematic biases. For generation, Sequence Likelihood Calibration (SLiC) fine-tunes models on ranked completions. Temperature scaling — the simplest post-hoc fix — remains competitive in many settings.
-
No unified benchmark exists: The survey's most damning structural observation: there is no single benchmark spanning confidence estimation methods across tasks and domains. This makes it nearly impossible to rigorously compare methods. The field is evaluating apples against oranges.
What holds up — and what doesn't
The taxonomy is solid. The white-box vs. black-box distinction is genuinely useful for system design, and the treatment of logit-based methods is honest about their limits — the authors note directly that token probability conflates factual confidence with lexical uncertainty. Practitioners underestimate this conflation.
Where the survey frustrates me: it is largely descriptive. There are almost no experimental benchmarks comparing methods head-to-head, and the authors acknowledge this explicitly as a limitation. I can leave with a clear design-space map but no guidance on which method to use for a new task.
The verbalized-confidence results — GPT-4's AUROC ~62.7% on its own stated confidence — should be canonical knowledge for anyone deploying LLMs in production. It isn't. People still ship prompts that ask "on a scale of 1–10, how confident are you?" and treat the answer as meaningful. It isn't.
The survey is also thin on the RLHF calibration question: does post-training with human feedback make models better or worse calibrated? There is evidence both ways, and the survey largely sidesteps it.
Why this matters for finance AI
ReDAct stakes its safety story on having a calibrated uncertainty signal from the cheap model. The survey makes clear how hard that actually is. Logit-based signals are available in white-box settings but conflate lexical and factual uncertainty. Consistency-based methods work in black-box settings but require multiple samples per decision — expensive for a high-throughput Beancount write-back agent processing a batch of transaction entries.
The most actionable finding for Bean Labs: semantic entropy clusters semantically equivalent answers before scoring consistency, which is precisely what matters for ledger entries where a model might express the same debit/credit relationship in multiple syntactically distinct forms. A Beancount agent should use semantic clustering over sampled ledger-entry completions — not raw token-level variance — to detect when it is hallucinating an account name or amount.
The calibration failure of verbalized confidence is a direct warning for any UI that surfaces "how confident is the AI?" to the user: do not trust the number the model produces. Use an external calibrator or consistency-based method instead, or don't surface it at all.
What to read next
- Farquhar et al., "Detecting hallucinations in large language models using semantic entropy," Nature, 2024 — the most rigorous method that comes out of this survey framework; worth reading in full rather than through the survey's summary.
- Manakul et al., "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models," EMNLP 2023 (arXiv:2303.08896) — the canonical consistency-based method; essential to understand before deploying any black-box confidence signal.
- Groot et al., "Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models," TrustNLP at ACL 2024 (arXiv:2405.02917) — the most thorough empirical audit of how verbalized confidence breaks down across models and tasks.
