Does Knowledge Distillation Really Work?

Stanton, Samuel, Izmailov, Pavel, Kirichenko, Polina, Alemi, Alexander A., Wilson, Andrew Gordon

Jun-10-2021–arXiv.org Machine Learning

Large, deep networks can learn representations that generalize well. While smaller, more efficient networks lack the inductive biases to find these representations from training data alone, they may have the capacity to represent these solutions [e.g., 1, 16, 27, 39]. Influential work on knowledge distillation [19] argues that Bucilă et al. [4] "demonstrate convincingly that the knowledge acquired by a large ensemble of models [the teacher] can be transferred to a single small model [the student]". Indeed this quote encapsulates the conventional narrative of knowledge distillation: a student model learns a high-fidelity representation of a larger teacher, enabled by the teacher's soft labels. Conversely, in Figure 1 we show that with modern architectures knowledge distillation can lead to students with very different predictions from their teachers, even when the student has the capacity to perfectly match the teacher.

deep learning, neural network, student, (19 more...)

arXiv.org Machine Learning

Jun-10-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.92)

Genre:
- Research Report (1.00)

Industry:
- Education (1.00)
- Government > Regional Government
  - North America Government > United States Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Representation & Reasoning (0.93)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found