da669dfd3c36c93905a17ddba01eef06-Supplemental-Conference.pdf

Neural Information Processing Systems 

As showninTable 10,adopting DIST with Pearson correlation achieveshigher accuracies compared to KD and DIST with cosine similarity, especially when the teacher and student are trained with label smoothing (the predicted probabilistic distributions would be shifted by it). The speed is tested based on our implementations on 8 NVIDIAV100GPUs. KDRKDSRRLCRDDIST [16] [30] [47] [41] 14.28 11.11 12.98 8.33 14.19 A.5 Landscapesofmatchingfunctions As discussed in our main text, the matching functions such as KL divergence and MSE are used tomatch the outputs between student and teacher inKD. DIST can improve multi-class classification consistently in various tasks such as classification, object detection, andsemantic segmentation. However,itwouldbeless-effectiveon binary classification task, as the task only contains two classes and the information in inter-class relation is limited.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found