376c6b9ff3bedbbea56751a84fffc10c-Supplemental.pdf

Neural Information Processing Systems 

Does Knowledge Distillation Really Work? Here we briefly describe key implementation details to reproduce our experiments. Data augmentation details are given in A.1, followed by architecture details in A.2, and finally training details are provided in A.3. The reader is encouraged to consult the included code for closer inspection. A.1 Data augmentation procedures Some of the data augmentation procedures we consider attempt to generate data that is close to the train data distribution (standard augmentations, GAN, mixup). Others (random noise, out-of-domain data) produce data for distillation that the teacher would never encounter during normal supervised training.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found