Efficient Knowledge Distillation from Model Checkpoints

Oct-9-2024, 09:50:14 GMT–Neural Information Processing Systems

Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student models, it is commonly believed that a high performing teacher is preferred. Consequently, practitioners tend to use a well trained network or an ensemble of them as the teacher. In this paper, we observe that an intermediate model, i.e., a checkpoint in the middle of the training procedure, often serves as a better teacher compared to the fully converged model, although the former has much lower accuracy. More surprisingly, a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models, when they are used as teachers.

efficient knowledge distillation, intermediate model, model checkpoint, (2 more...)

Neural Information Processing Systems

Oct-9-2024, 09:50:14 GMT

Conferences Web Page

Add feedback

Industry:
- Education (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.42)