What Makes a " Good " Data Augmentation in Knowledge Distillation - A Statistical Perspective (with Appendix)

May-29-2025, 23:32:39 GMT–Neural Information Processing Systems

Knowledge distillation (KD) is a general neural network training approach that uses a teacher model to guide the student model. Existing works mainly study KD from the network output side (e.g., trying to design a better KD loss function), while few have attempted to understand it from the input side. Especially, its interplay with data augmentation (DA) has not been well understood. In this paper, we ask: Why do some DA schemes (e.g., CutMix) inherently perform much better than others in KD? What makes a "good" DA in KD? Our investigation from a statistical perspective suggests that a good DA scheme should reduce the covariance of the teacher-student cross-entropy.

artificial intelligence, machine learning, test loss, (19 more...)

Neural Information Processing Systems

May-29-2025, 23:32:39 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > Massachusetts (0.14)

Genre:
- Research Report (1.00)

Industry:
- Education (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Vision (0.93)