A Appendix A.1 Implementation of DIST

Neural Information Processing Systems 

This section presents the implementation code of DIST, as shown in Figure 4. The purpose of these methods is to learn the similarity relationships between instances from teacher, e.g., the semantic spaces of instances with the same KD methods, which help us to achieve better performance especially when the student is trained with a stronger teacher. Here we conduct experiments to investigate the efficacy of our method with cosine similarity. As discussed in our main text, the matching functions such as KL divergence and MSE are used to match the outputs between student and teacher in KD.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found