A Additional Related Work KD has been extensively applied to computer vision and NLP tasks [ 52
–Neural Information Processing Systems
Examples of such knowledge definitions include output logits (e.g., DistilBERT [ We have defined the three types of KD: 1S-KD, 2S-KD, and 3S-KD in main text. Here we explain in more details. One-Stage KD means we naively minimize the sum of teacher-student differences on hidden-states, attentions and logits. Finally, Three-Stage KD succeeds the properties of 1S-KD and 2S-KD. See Table C.1 for the size of each (augmented) We overall find the standard deviation are within 0.1 on the GLUE score.
Neural Information Processing Systems
Nov-13-2025, 10:32:16 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.47)
- Natural Language (0.64)
- Vision (0.40)
- Information Technology > Artificial Intelligence