Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study
Shen, Zhiqiang, Liu, Zechun, Xu, Dejia, Chen, Zitian, Cheng, Kwang-Ting, Savvides, Marios
–arXiv.org Artificial Intelligence
This work aims to empirically clarify a recently discovered perspective that label smoothing is incompatible with knowledge distillation (Müller et al., 2019). We begin by introducing the motivation behind on how this incompatibility is raised, i.e., label smoothing erases relative information between teacher logits. We provide a novel connection on how label smoothing affects distributions of semantically similar and dissimilar classes. Then we propose a metric to quantitatively measure the degree of erased information in sample's representation. After that, we study its one-sidedness and imperfection of the incompatibility view through massive analyses, visualizations and comprehensive experiments on Image Classification, Binary Networks, and Neural Machine Translation. Finally, we broadly discuss several circumstances wherein label smoothing will indeed lose its effectiveness. Recently a large body of studies is focusing on exploring the underlying relationships between these two methods, for instance, Müller et al. (Müller et al., 2019) discovered that label smoothing could improve calibration implicitly but will hurt the effectiveness of knowledge distillation. Yuan et al. (Yuan et al., 2019) considered knowledge distillation as a dynamical form of label smoothing as it delivered a regularization effect in training. The recent study (Lukasik et al., 2020) further noticed label smoothing could help mitigate label noise, they showed that when distilling models from noisy data, the teacher with label smoothing is helpful.
arXiv.org Artificial Intelligence
Apr-1-2021