Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again

Dec-23-2025, 20:22:03 GMT–Neural Information Processing Systems

Knowledge Distillation (KD) aims at transferring the knowledge of a well-performed neural network (the {\it teacher}) to a weaker one (the {\it student}). A peculiar phenomenon is that a more accurate model doesn't necessarily teach better, and temperature adjustment can neither alleviate the mismatched capacity.

electronic proceedings, name change, wrong class probability, (5 more...)

Neural Information Processing Systems

Dec-23-2025, 20:22:03 GMT

Conferences Web Page

Add feedback

Country:
- Asia > China > Ningxia Hui Autonomous Region > Yinchuan (0.07)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.39)