Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
Hobelsberger, Christian, Winner, Theresa, Nawroth, Andreas, Mitevski, Oliver, Haensch, Anna-Carolina
–arXiv.org Artificial Intelligence
Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.
arXiv.org Artificial Intelligence
Oct-24-2025
- Country:
- Europe > Germany
- Bavaria > Upper Bavaria > Munich (0.77)
- North America > United States
- Maryland (0.04)
- Europe > Germany
- Genre:
- Research Report > New Finding (0.86)
- Technology: