A Appendix

Neural Information Processing Systems 

We first give a derivation on the equivalence of label smoothing regularization and Eq. 7. Evidently, the objective does not regularize confidence diversity. "Scale both" corresponds to the originally proposed distillation objective in which both teacher and Plots of test accuracy and ECE against amount of temperature scaling applied are shown in Figure 1. Firstly, we observe that models trained with student scaling have ECE almost identical to that of the teacher models. As a direct contrast, we see that the student models trained without student scaling perform much better in terms of calibration error in general over its teacher. This coupled effect could be the reason for the observed conflict between ECE and accuracy.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found