A Additional Related Work KD has been extensively applied to computer vision and NLP tasks [ 52

Neural Information Processing Systems 

Examples of such knowledge definitions include output logits (e.g., DistilBERT [ We have defined the three types of KD: 1S-KD, 2S-KD, and 3S-KD in main text. Here we explain in more details. One-Stage KD means we naively minimize the sum of teacher-student differences on hidden-states, attentions and logits. Finally, Three-Stage KD succeeds the properties of 1S-KD and 2S-KD. See Table C.1 for the size of each (augmented) We overall find the standard deviation are within 0.1 on the GLUE score.