Beyond Overconfidence: Foundation Models Redefine Calibration in Deep Neural Networks

Hekler, Achim, Kuhn, Lukas, Buettner, Florian

Jun-12-2025–arXiv.org Artificial Intelligence

Reliable uncertainty calibration of neural networks is crucial for safety-critical applications. Current calibration research has two major limitations: the exclusive evaluation of large web-scraped datasets and the lack of investigation of contemporary high-performance models with recent architectural and training innovations. To address this gap, we conducted a systematic investigation of different model generations on diverse datasets, revealing insights that challenge established calibration paradigms. Our results show that current-generation models consistently exhibit underconfidence in their in-distribution predictions - contrasting with the overconfidence typically reported in earlier model generations - while showing improved calibration under distribution shift. Although post-hoc calibration techniques significantly improve in-distribution calibration performance, their effectiveness progressively diminishes with increasing distribution shift, ultimately becoming counterproductive in extreme cases. Critically, extending our analysis to four diverse biomedical imaging datasets using transfer learning highlights the limited transferability of insights from web-scraped benchmarks. In these domains, convolutional architectures consistently achieve superior calibration compared to transformer-based counterparts, irrespective of model generation. Our findings underscore that model advancements have complex effects on calibration, challenging simple narratives of monotonic improvement, and emphasize the critical need for domain-specific architectural evaluation beyond standard benchmarks. This dual requirement is particularly critical in high-stakes domains such as medical diagnosis [1, 2], autonomous driving [3], and financial decision-making [4], where misaligned confidence estimates can lead to incorrect decisions with potentially severe or life-threatening consequences. Model calibration offers a systematic framework for evaluating the reliability of a model's predictive confidence [5, 6]. In a well-calibrated model, confidence scores align closely with the true likelihood of correctness.

artificial intelligence, distribution shift, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Jun-12-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Germany (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine
  - Therapeutic Area > Oncology (0.93)
  - Diagnostic Medicine > Imaging (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found