Beyond Overconfidence: Foundation Models Redefine Calibration in Deep Neural Networks
Hekler, Achim, Kuhn, Lukas, Buettner, Florian
–arXiv.org Artificial Intelligence
Reliable uncertainty calibration of neural networks is crucial for safety-critical applications. Current calibration research has two major limitations: the exclusive evaluation of large web-scraped datasets and the lack of investigation of contemporary high-performance models with recent architectural and training innovations. To address this gap, we conducted a systematic investigation of different model generations on diverse datasets, revealing insights that challenge established calibration paradigms. Our results show that current-generation models consistently exhibit underconfidence in their in-distribution predictions - contrasting with the overconfidence typically reported in earlier model generations - while showing improved calibration under distribution shift. Although post-hoc calibration techniques significantly improve in-distribution calibration performance, their effectiveness progressively diminishes with increasing distribution shift, ultimately becoming counterproductive in extreme cases. Critically, extending our analysis to four diverse biomedical imaging datasets using transfer learning highlights the limited transferability of insights from web-scraped benchmarks. In these domains, convolutional architectures consistently achieve superior calibration compared to transformer-based counterparts, irrespective of model generation. Our findings underscore that model advancements have complex effects on calibration, challenging simple narratives of monotonic improvement, and emphasize the critical need for domain-specific architectural evaluation beyond standard benchmarks. This dual requirement is particularly critical in high-stakes domains such as medical diagnosis [1, 2], autonomous driving [3], and financial decision-making [4], where misaligned confidence estimates can lead to incorrect decisions with potentially severe or life-threatening consequences. Model calibration offers a systematic framework for evaluating the reliability of a model's predictive confidence [5, 6]. In a well-calibrated model, confidence scores align closely with the true likelihood of correctness.
arXiv.org Artificial Intelligence
Jun-12-2025
- Country:
- Europe > Germany
- Baden-Württemberg > Karlsruhe Region
- Heidelberg (0.04)
- Hesse > Darmstadt Region
- Frankfurt (0.04)
- Baden-Württemberg > Karlsruhe Region
- Europe > Germany
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine
- Diagnostic Medicine > Imaging (0.67)
- Therapeutic Area > Oncology (0.93)
- Health & Medicine
- Technology: