1579d5d8edacd85ac1a86aea28bdf32d-Supplemental-Conference.pdf

Neural Information Processing Systems 

KD has been extensively applied to computer vision and NLP tasks [52] since its debut. B.1 KnowledgeDistillation Knowledge Distillation (KD) [16] has been playing the most significant role in overcoming the performance degradation of model compression as the smaller models (i.e., student models) can absorb the rich knowledge of those uncompressed ones (i.e., teacher models) [40, 25, 43, 14]. Forthesecond partASi (ATi)istheattention matrix corresponds to thei-th heads (in our setting,h = 12). In the final part, the dimensionc in logit outputs (pS and pT) is either to be2 or 3 for GLUE tasks. Here weexplain inmore details.One-StageKD means wenaivelyminimize the sum of teacher-student differences on hidden-states, attentions and logits.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found