Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again, Wen-Shu Fan

Neural Information Processing Systems 

Knowledge Distillation (KD) aims at transferring the knowledge of a wellperformed neural network (the teacher) to a weaker one (the student). A peculiar phenomenon is that a more accurate model doesn't necessarily teach better, and temperature adjustment can neither alleviate the mismatched capacity. To explain this, we decompose the efficacy of KD into three parts: correct guidance, smooth regularization, and class discriminability. The last term describes the distinctness of wrong class probabilities that the teacher provides in KD. Complex teachers tend to be over-confident and traditional temperature scaling limits the efficacy of class discriminability, resulting in less discriminative wrong class probabilities. Therefore, we propose Asymmetric Temperature Scaling (ATS), which separately applies a higher/lower temperature to the correct/wrong class. ATS enlarges the variance of wrong class probabilities in the teacher's label and makes the students grasp the absolute affinities of wrong classes to the target class as discriminative as possible. Both theoretical analysis and extensive experimental results demonstrate the effectiveness of ATS.