Revisit the Power of Vanilla Knowledge Distillation: from Small Scale to Large Scale Supplementary Material

Neural Information Processing Systems 

A.1 Details of "stronger recipe" In Table 1 of our main paper, we evaluate the impact of limited model capacity [1] and small-scale dataset by comparing the results of using "previous training recipe" and our "stronger recipe". We summarize the details of "stronger recipe" and present them in Table 13. Table 13: Stronger training strategy used for distillation. "B" and "C" represent strategies for training students on ImageNet-1K and CIFAR100, respectively. A.2 Numerical results In Figure 1 of our main paper, we present a comparison of performance gaps among vanilla KD and two logits-based baselines, i.e., DKD [2] and DIST [3], on two datasets of varying scales, to demonstrate the underestimation of vanilla KD on small-scale datasets.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found