Calibrating Verbalized Confidence with Self-Generated Distractors

Wang, Victor, Stengel-Eskin, Elias

arXiv.org Artificial Intelligence 

Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM's heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Users often rely on information obtained from these models to make important decisions, but the information is not always accurate. Thus, we seek to qualify LLM responses with confidence estimates that are calibrated, i.e. match the probability of correctness. Users and agentic frameworks often use LLMs in a zero-shot manner without task-specific tuning (Manakul et al., 2023; Geng et al., 2024; Feng et al., 2024; Shorinwa et al., 2025), motivating the development of confidence estimation methods that work in off-the-shelf settings - both gray-box settings with logit access, and black-box settings with only textual input and output. In these settings, verbalized confidence is a simple and commonly-used approach that prompts the model to report its confidence in an answer (Lin et al., 2022; Xiong et al., 2024; Wei et al., 2024). For brevity, we use verbalized confidence as a blanket term for (1) asking the model to decode a numerical confidence like "80%" (Tian et al., 2023) and (2) asking the model whether an answer is correct and taking P(True) (Kadavath et al., 2022). V erbalized confidence is appealing for several reasons, including that it resembles one way humans express confidence, making it easy to interpret and integrate into decision-theoretic frameworks (Sun et al., 2025; Steyvers et al., 2025). However, verbalized confidence has several drawbacks. First, it empirically tends to exhibit overconfidence (Tian et al., 2023; Xiong et al., 2024; Wei et al., 2024; Xu et al., 2025); Figure 1 (left) shows that verbalized confidence scores generally outstrip average accuracy within a confidence bin. For each bar, we label the number of instances whose confidence falls in the interval and we darken larger bins. In other words, no rejection threshold can be chosen to reject a high proportion of false claims.