Reassessing How to Compare and Improve the Calibration of Machine Learning Models
Standard machine learning models are trained to predict probability distributions over a set of possible actions or outcomes. Model-based decision-making is then typically done by using the action or outcome associated with the highest probability, and ideally one would like to interpret the model-predicted probability as a notion of confidence in the predicted action/outcome. In order for this confidence interpretation to be valid, it is crucial that the predicted probabilities are calibrated (Lichtenstein et al., 1982; Dawid, 1982; DeGroot & Fienberg, 1983), or accurately reflect the true frequencies of the outcome conditional on the prediction. As an informal (classic) example, a calibrated weather prediction model would satisfy the property that we observe rain 80% of the time on days for which our model predicted a 0.8 probability of rain. As the applications of machine learning models - particularly deep learning models - continue to expand to include high-stakes areas such as medical image diagnoses (Mehrtash et al., 2019; Elmarakeby et al., 2021; Nogales et al., 2021) and self-driving cars (Hu et al., 2023), so too does the importance of having calibrated model probabilities. Unfortunately, the seminal empirical investigation of Guo et al. (2017) demonstrated that deep learning models can be poorly calibrated, largely due to overconfidence. This observation has led to a number of follow-up works intended to improve model calibration using both training-time (Thulasidasan et al., 2019; Müller et al., 2020; Wang et al., 2021) and post-training methods (Joy et al., 2022; Gupta & Ramdas, 2022). Comparing these proposed improvements, however, is non-trivial due to the fact that the measurement of calibration in practice is itself an active area of research (Nixon et al.,
Jun-6-2024
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine > Diagnostic Medicine
- Imaging (0.34)
- Information Technology > Robotics & Automation (0.34)
- Transportation > Ground
- Road (0.34)
- Health & Medicine > Diagnostic Medicine
- Technology: