What Makes a " Good " Data Augmentation in Knowledge Distillation - A Statistical Perspective (with Appendix)
–Neural Information Processing Systems
Knowledge distillation (KD) is a general neural network training approach that uses a teacher model to guide the student model. Existing works mainly study KD from the network output side (e.g., trying to design a better KD loss function), while few have attempted to understand it from the input side. Especially, its interplay with data augmentation (DA) has not been well understood. In this paper, we ask: Why do some DA schemes (e.g., CutMix) inherently perform much better than others in KD? What makes a "good" DA in KD? Our investigation from a statistical perspective suggests that a good DA scheme should reduce the covariance of the teacher-student cross-entropy.
Neural Information Processing Systems
May-29-2025, 23:32:39 GMT
- Country:
- North America > United States > Massachusetts (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.66)
- Technology: