Reassessing How to Compare and Improve the Calibration of Machine Learning Models