Measuring Language Model Hallucinations Through Distributional Correctness

Burns, Thomas F

arXiv.org Artificial Intelligence 

Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward "I don't know" responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model's entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS's variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination. Evaluation of language models has commonly focused on whether they produce'correct' or desired outputs in response to given inputs or instructions, as measured using accuracy or probability-based scoring rules that account for confidence in model predictions. However, the paradigm of focusing on a single answer fundamentally misses a critical aspect of evaluating performance: how models distribute their beliefs across the space of possible responses, including the possibility of abstaining from answering in conditions of uncertainty. Recent work (Kalai et al., 2025) provides compelling evidence that language model'hallucinations' persist in-part due to the socio-technical problem of flawed evaluation metrics. Under traditional binary scoring - where correct answers receive a positive score (maximally 1 for perfect correctness), any response like "I don't know" (IDK) receives 0, and incorrect answers also receive 0 - the optimal strategy for any rational agent is to always guess rather than abstain, even when confidence in the guess is minimal. This creates a systematic bias in our evaluation paradigms toward overconfident responses and offers a socio-technical explanation for why language models persist in making confident assertions about uncertain information, i.e., 'hallucinate'.