Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again

Oct-10-2024, 00:46:42 GMT–Neural Information Processing Systems

Knowledge Distillation (KD) aims at transferring the knowledge of a well-performed neural network (the {\it teacher}) to a weaker one (the {\it student}). A peculiar phenomenon is that a more accurate model doesn't necessarily teach better, and temperature adjustment can neither alleviate the mismatched capacity. The last term describes the distinctness of {\it wrong class probabilities} that the teacher provides in KD. Complex teachers tend to be over-confident and traditional temperature scaling limits the efficacy of {\it class discriminability}, resulting in less discriminative wrong class probabilities. Therefore, we propose {\it Asymmetric Temperature Scaling (ATS)}, which separately applies a higher/lower temperature to the correct/wrong class.

class discriminability, efficacy, wrong class probability, (2 more...)

Neural Information Processing Systems

Oct-10-2024, 00:46:42 GMT

Conferences Web Page

Add feedback

Country:
- Asia > China > Ningxia Hui Autonomous Region > Yinchuan (0.08)

Genre:
- Play > Prospect > Charge > Source (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.42)