Highlight Every Step: Knowledge Distillation via Collaborative Teaching
Zhao, Haoran, Sun, Xin, Dong, Junyu, Chen, Changrui, Dong, Zihe
–arXiv.org Artificial Intelligence
--High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation aims to train a compact student network by transferring knowledge from a larger pre-trained teacher model. However, most existing methods on knowledge distillation ignore the valuable information among training process associated with training results. In this paper, we provide a new Collaborative T eaching Knowledge Distillation (CTKD) strategy which employs two special teachers. Specifically, one teacher trained from scratch (i.e., scratch teacher) assists the student step by step using its temporary outputs. It forces the student to approach the optimal path towards the final logits with high accuracy. The other pre-trained teacher (i.e., expert teacher) guides the student to focus on a critical region which is more useful for the task. The combination of the knowledge from two special teachers can significantly improve the performance of the student network in knowledge distillation. The results of experiments on CIF AR-10, CIF AR-100, SVHN and Tiny ImageNet datasets verify that the proposed knowledge distillation method is efficient and achieves state-of-the-art performance. ECENTL Y, deep neural networks achieved superior performance in a variety of applications such as computer vision [1][2][3][4] and natural language processing [5][6]. However, along with high-performance, the deep neural network's architecture becomes much deeper and wider which requires a high cost of computation and memory in inference. It is a great burden to deploy these models on edge-computing systems such as embedded devices and mobile-phones. Therefore, many methods [7][8][9][10][11] are proposed to reduce the deep neural network's computational complexity and high storage. Some lightweight networks like Inception [12], MobileNet [13], ShuffleNet [14], SqueezeNet [15] and Condense-Net [16] have been proposed to reduce the network size as much as possible under the condition of keeping a high recognition accuracy. All the above mentioned methods focus on physically reducing internal redundancy of the model to obtain a shallow and thin architecture.
arXiv.org Artificial Intelligence
Oct-1-2025
- Country:
- Europe (0.68)
- Asia > China (0.28)
- North America
- United States (0.46)
- Canada (0.46)
- Genre:
- Research Report (0.82)
- Industry:
- Education (0.47)
- Technology: