One-for-All: Bridge the Gap of Heterogeneous Architectures in Knowledge Distillation Supplementary Material

Neural Information Processing Systems 

As the current best MLP-based model achieves only 83.8% Consequently, we employ a ViT -B model with a top-1 accuracy of 86.53% as the teacher model for conducting cross-architecture distillation. Our OFA-KD framework exhibits significant enhancements in terms of top-1 accuracy when compared to models trained from scratch. The observed improvements range from 1.2% to 2.6%. To conduct the analysis, we select a batch of 128 samples from the ImageNet-1K validation set and collect model activations after the activation layers. More results of CKA analysis are provided in Figure 6.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found