Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again
–Neural Information Processing Systems
Knowledge Distillation (KD) aims at transferring the knowledge of a well-performed neural network (the {\it teacher}) to a weaker one (the {\it student}). A peculiar phenomenon is that a more accurate model doesn't necessarily teach better, and temperature adjustment can neither alleviate the mismatched capacity.
Neural Information Processing Systems
Dec-23-2025, 20:22:03 GMT