da669dfd3c36c93905a17ddba01eef06-Supplemental-Conference.pdf

Feb-12-2026, 06:26:36 GMT–Neural Information Processing Systems

As showninTable 10,adopting DIST with Pearson correlation achieveshigher accuracies compared to KD and DIST with cosine similarity, especially when the teacher and student are trained with label smoothing (the predicted probabilistic distributions would be shifted by it). The speed is tested based on our implementations on 8 NVIDIAV100GPUs. KDRKDSRRLCRDDIST [16] [30] [47] [41] 14.28 11.11 12.98 8.33 14.19 A.5 Landscapesofmatchingfunctions As discussed in our main text, the matching functions such as KL divergence and MSE are used tomatch the outputs between student and teacher inKD. DIST can improve multi-class classification consistently in various tasks such as classification, object detection, andsemantic segmentation. However,itwouldbeless-effectiveon binary classification task, as the task only contains two classes and the information in inter-class relation is limited.

artificial intelligence, dist, machine learning, (8 more...)

Neural Information Processing Systems

Feb-12-2026, 06:26:36 GMT

Conferences PDF

Add feedback

Industry:
- Education (0.75)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
A Appendix A.1 Implementation of DIST

Similar Docs Excel Report more

Title	Similarity	Source
None found