Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again

Neural Information Processing Systems 

Knowledge Distillation (KD) aims at transferring the knowledge of a well-performed neural network (the teacher) to a weaker one (the student).