Knowledge Distillation Performs Partial Variance Reduction

Safaryan, Mher, Peste, Alexandra, Alistarh, Dan

arXiv.org Artificial Intelligence 

Knowledge Distillation (KD) [13, 3] is a standard tool for transferring information between a machine learning model of lower representational capacity-usually called the student-and a more accurate and powerful teacher model. In the context of classification using neural networks, it is common to consider the student to be a smaller network [2], whereas the teacher is a network that is larger and more computationally-heavy, but also more accurate. Assuming a supervised classification task, distillation consists in training the student to minimize the cross-entropy with respect to the teacher's logits on every given sample, in addition to minimizing the standard cross-entropy loss with respect to the ground truth labels. Since its introduction [3], distillation has been developed and applied in a wide variety of settings, from obtaining compact high-accuracy encodings of model ensembles [13], to boosting the accuracy of compressed models [49, 38, 31], to reinforcement learning [50, 42, 35, 5, 7, 45] and learning with privileged information [51]. Given its apparent simplicity, there has been significant interest in finding explanations for the effectiveness of distillation [2, 13, 37]. For instance, one hypothesis [2, 13] is that the smoothed labels resulting from distillation present the student with a decision surface that is easier to learn than the one presented by the categorical (one-hot) outputs. Another hypothesis [2, 13, 51] starts from the observation that the teacher's outputs have higher entropy than the ground truth labels, and therefore, higher information content. Despite this work, we still have a limited analytical understanding regarding why knowledge distillation is so effective [37].