Review for NeurIPS paper: Intra Order-preserving Functions for Calibration of Multi-Class Neural Networks

Neural Information Processing Systems 

Additional Feedback: The proposed methods perform very strongly in ECE, slightly better than the state-of-the-art in NLL and slightly worse in classwise-ECE. It would be good to have some explanation about why ECE and classwise-ECE give so different results. As ECE studies the calibration of only the class with the highest predicted probability and ignores other class probabilities, does it mean that the proposed method is better than the state-of-the-art in top-1 probability but slightly weaker on other classes? In the appendix provided as supplemental material, at lines 739-742 it is claimed that ECE does not suffer from the same problem that is highlighted about classwise-ECE at lines 731-738. While this is technically correct, it misses the point. Actually, ECE also suffers from essentially the same problem.