How Flawed is ECE? An Analysis via Logit Smoothing

Chidambaram, Muthu, Lee, Holden, McSwiggen, Colin, Rezchikov, Semon

arXiv.org Artificial Intelligence 

The prevalence of machine learning across domains has increased drastically over the past few years, spurred by significant breakthroughs in deep learning for computer vision (Ramesh et al., 2022) and language modeling (Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023). Consequently, the underlying deep learning models are increasingly being evaluated for critical use cases such as predicting medical diagnoses (Elmarakeby et al., 2021; Nogales et al., 2021) and self-driving (Hu et al., 2023). In these latter cases, due to the risk associated with incorrect decision-making, it is crucial not only that the models be accurate, but also that they have proper predictive uncertainty. This desideratum is formalized via the notion of calibration (Dawid, 1982; DeGroot & Fienberg, 1983), which codifies how well the model-predicted probabilities for events reflect their true frequencies conditional on the predictions. For example, in a medical context, a model that yields the correct diagnosis for a patient 95% of the time when it predicts a probability of 0.95 for that diagnosis can be considered to be calibrated. The analysis of whether modern deep learning models are calibrated can be traced back to the influential work of Guo et al. (2017), which showed that recent models exhibit calibration issues not present in earlier models; in particular, they are overconfident when they are incorrect.