Goto

Collaborating Authors

 cutmix



Appendix A Further Empirical Studies

Neural Information Processing Systems

As reported in Table A3, PS-MT consistently shows lower distances than Dual Teacher shows. The STD is similarly between 2 and over 50 times smaller. PS-MT's teachers (albeit they may have distinct characteristics) potentially becomes similar distances to the student at each epoch. Comparative analysis of performance based on different CutMix variations. We further report additional quantitative results encompassing three different splits: original high-quality set, blended set, and blended high-quality set .



What Makes a "Good" Data Augmentation in Knowledge Distillation - A Statistical Perspective

Neural Information Processing Systems

Knowledge distillation (KD) is a general neural network training approach that uses a teacher model to guide the student model. Existing works mainly study KD from the network output side (e.g., trying to design a better KD loss function), while few have attempted to understand it from the input side. Especially, its interplay with data augmentation (DA) has not been well understood. In this paper, we ask: Why do some DA schemes (e.g., CutMix) inherently perform much better than others in KD? What makes a "good" DA in KD? Our investigation from a statistical perspective suggests that a good DA scheme should reduce the covariance of the teacher-student cross-entropy.


RecursiveMix: Mixed Learning with History

Neural Information Processing Systems

Mix-based augmentation has been proven fundamental to the generalization of deep vision models. However, current augmentations only mix samples from the current data batch during training, which ignores the possible knowledge accumulated in the learning history. In this paper, we propose a recursive mixed-sample learning paradigm, termed ``RecursiveMix'' (RM), by exploring a novel training strategy that leverages the historical input-prediction-label triplets. More specifically, we iteratively resize the input image batch from the previous iteration and paste it into the current batch while their labels are fused proportionally to the area of the operated patches. Furthermore, a consistency loss is introduced to align the identical image semantics across the iterations, which helps the learning of scale-invariant feature representations. Based on ResNet-50, RM largely improves classification accuracy by $\sim$3.2% on CIFAR-100 and $\sim$2.8% on ImageNet with negligible extra computation/storage costs. In the downstream object detection task, the RM-pretrained model outperforms the baseline by 2.1 AP points and surpasses CutMix by 1.4 AP points under the ATSS detector on COCO.