A Appendix A.1 Implementation of DIST
–Neural Information Processing Systems
This section presents the implementation code of DIST, as shown in Figure 4. The purpose of these methods is to learn the similarity relationships between instances from teacher, e.g., the semantic spaces of instances with the same KD methods, which help us to achieve better performance especially when the student is trained with a stronger teacher. Here we conduct experiments to investigate the efficacy of our method with cosine similarity. As discussed in our main text, the matching functions such as KL divergence and MSE are used to match the outputs between student and teacher in KD.
Neural Information Processing Systems
Aug-19-2025, 09:01:25 GMT