Does Knowledge Distillation Really Work?

Stanton, Samuel, Izmailov, Pavel, Kirichenko, Polina, Alemi, Alexander A., Wilson, Andrew Gordon

arXiv.org Machine Learning 

Large, deep networks can learn representations that generalize well. While smaller, more efficient networks lack the inductive biases to find these representations from training data alone, they may have the capacity to represent these solutions [e.g., 1, 16, 27, 39]. Influential work on knowledge distillation [19] argues that Bucilă et al. [4] "demonstrate convincingly that the knowledge acquired by a large ensemble of models [the teacher] can be transferred to a single small model [the student]". Indeed this quote encapsulates the conventional narrative of knowledge distillation: a student model learns a high-fidelity representation of a larger teacher, enabled by the teacher's soft labels. Conversely, in Figure 1 we show that with modern architectures knowledge distillation can lead to students with very different predictions from their teachers, even when the student has the capacity to perfectly match the teacher.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found