Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head
Yang, Penghui, Zong, Chen-Chen, Huang, Sheng-Jun, Feng, Lei, An, Bo
–arXiv.org Artificial Intelligence
Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-ofthe-art counterparts. Despite the remarkable success of deep neural networks (DNNs) in various fields, it is a significant challenge to deploy these large models in lightweight terminals (e.g., mobile phones), particularly under the constraint of computational resources or the requirement of short inference time. To mitigate this problem, knowledge distillation (KD) (Hinton et al., 2015) is widely investigated, which aims to improve the performance of a small network (a.k.a. the "student") by leveraging the expansive knowledge of a large network (a.k.a. the "teacher") to guide the training of the student network. Traditional KD techniques focus on minimizing the disparity in the predicted probabilities between the teacher and the student, which are typically the outputs of the softmax function. Nevertheless, the transformation from logits to predictive probabilities via the softmax function may lose some underlying information.
arXiv.org Artificial Intelligence
Nov-13-2024
- Genre:
- Research Report
- New Finding (0.46)
- Promising Solution (0.34)
- Research Report
- Industry:
- Education (0.31)
- Technology: