Goto

Collaborating Authors

 undistillable class



TeachLess, LearnMore: OntheUndistillableClassesinKnowledgeDistillation

Neural Information Processing Systems

A counter-intuitive observation is that a more expansive teacher does not make a better student, but the reasons for this phenomenon remain unclear.


Teach Less, Learn More: On the Undistillable Classes in Knowledge Distillation

Neural Information Processing Systems

Knowledge distillation (KD) can effectively compress neural networks by training a smaller network (student) to simulate the behavior of a larger one (teacher). A counter-intuitive observation is that a more expansive teacher does not make a better student, but the reasons for this phenomenon remain unclear. In this paper, we demonstrate that this is directly attributed to the presence of \textit{undistillable classes}: when trained with distillation, the teacher's knowledge of some classes is incomprehensible to the student model. We observe that while KD improves the overall accuracy, it is at the cost of the model becoming inaccurate in these undistillable classes. After establishing their widespread existence in state-of-the-art distillation methods, we illustrate their correlation with the capacity gap between teacher and student models. Finally, we present a simple Teach Less Learn More (TLLM) framework to identify and discard the undistillable classes during training.


A Limitations and Potential Negative Social Impacts

Neural Information Processing Systems

Our work investigates the "larger teacher, worse student" phenomena in knowledge However, we only discuss image classification. Therefore, we do not guarantee the validity of our observation on other tasks, i.e., object detection, In addition, these classes can be sensitive, i.e., gender We hope future work can completely resolve this issue. Since most of these method provides hyper-parameters for CIFAR100, we do not modify them. In Section 2.2 we use modified ResNet24 as student to perform KD on a ResNet56 teacher model. We have mentioned the existence of the undistillable classes in general to various methods, and Table 1 gives a comprehensive list of methods for which we studied.



Teach Less, Learn More: On the Undistillable Classes in Knowledge Distillation

Neural Information Processing Systems

Knowledge distillation (KD) can effectively compress neural networks by training a smaller network (student) to simulate the behavior of a larger one (teacher). A counter-intuitive observation is that a more expansive teacher does not make a better student, but the reasons for this phenomenon remain unclear. In this paper, we demonstrate that this is directly attributed to the presence of \textit{undistillable classes}: when trained with distillation, the teacher's knowledge of some classes is incomprehensible to the student model. We observe that while KD improves the overall accuracy, it is at the cost of the model becoming inaccurate in these undistillable classes. After establishing their widespread existence in state-of-the-art distillation methods, we illustrate their correlation with the capacity gap between teacher and student models.