Collaborative Distillation for Top-N Recommendation
Lee, Jae-woong, Choi, Minjin, Lee, Jongwuk, Shim, Hyunjung
--Knowledge distillation (KD) is a well-known method to reduce inference latency by compressing a cumbersome teacher model to a small student model. Despite the success of KD in the classification task, applying KD to recommender models is challenging due to the sparsity of positive feedback, the ambiguity of missing feedback, and the ranking problem associated with the top-N recommendation. T o address the issues, we propose a new KD model for the collaborative filtering approach, namely collaborative distillation ( CD). Specifically, (1) we reformulate a loss function to deal with the ambiguity of missing feedback. Via experimental results, we demonstrate that the proposed model outperforms the state-of-the-art method by 2.7-33.2% Moreover, the proposed model achieves the performance comparable to the teacher model. Neural recommender models [1]-[9] have achieved better performance than conventional latent factor models either by capturing nonlinear and complex correlation patterns among users/items, or by leveraging the hidden features extracted from auxiliary information such as texts and images. However, the number of model parameters of neural models is greater than that of conventional models by one or more orders of magnitude. This indicates a tradeoff between accuracy and efficiency. As a result, neural recommender models usually suffer from higher latency during the inference phase. Our primary goal is to develop a recommender model that achieves a balance between effectiveness and efficiency. In this paper, we employ knowledge distillation (KD) [10] which is a network compression technique by transferring the distilled knowledge of a large model (a.k.a., a teacher model) to a small model (a.k.a., a student model). As the student model can utilize the knowledge transferred from the teacher model, it naturally exhibits the properties of computational efficiency and low memory usage. Therefore, it is capable of achieving a balance between effectiveness and efficiency. Specifically, the training procedure for KD consists of two steps. In the offline training phase, the teacher model is supervised by a training dataset with labels.
Nov-12-2019