Goto

Collaborating Authors

FEED: Feature-level Ensemble for Knowledge Distillation

arXiv.org Artificial Intelligence

Knowledge Distillation (KD) aims to transfer knowledge in a teacher-student framework, by providing the predictions of the teacher network to the student network in the training stage to help the student network generalize better. It can use either a teacher with high capacity or {an} ensemble of multiple teachers. However, the latter is not convenient when one wants to use feature-map-based distillation methods. For a solution, this paper proposes a versatile and powerful training algorithm named FEature-level Ensemble for knowledge Distillation (FEED), which aims to transfer the ensemble knowledge using multiple teacher networks. We introduce a couple of training algorithms that transfer ensemble knowledge to the student at the feature map level. Among the feature-map-based distillation methods, using several non-linear transformations in parallel for transferring the knowledge of the multiple teacher{s} helps the student find more generalized solutions. We name this method as parallel FEED, andexperimental results on CIFAR-100 and ImageNet show that our method has clear performance enhancements, without introducing any additional parameters or computations at test time. We also show the experimental results of sequentially feeding teacher's information to the student, hence the name sequential FEED, and discuss the lessons obtained. Additionally, the empirical results on measuring the reconstruction errors at the feature map give hints for the enhancements.


Self-Knowledge Distillation: A Simple Way for Better Generalization

arXiv.org Machine Learning

The generalization capability of deep neural networks has been substantially improved by applying a wide spectrum of regularization methods, e.g., restricting function space, injecting randomness during training, augmenting data, etc. In this work, we propose a simple yet effective regularization method named self-knowledge distillation (Self-KD), which progressively distills a model's own knowledge to soften hard targets (i.e., one-hot vectors) during training. Hence, it can be interpreted within a framework of knowledge distillation as a student becomes a teacher itself. The proposed method is applicable to any supervised learning tasks with hard targets and can be easily combined with existing regularization methods to further enhance the generalization performance. Furthermore, we show that Self-KD achieves not only better accuracy, but also provides high quality of confidence estimates. Extensive experimental results on three different tasks, image classification, object detection, and machine translation, demonstrate that our method consistently improves the performance of the state-of-the-art baselines, and especially, it achieves state-of-the-art BLEU score of 30.0 and 36.2 on IWSLT15 English-to-German and German-to-English tasks, respectively.


MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks

arXiv.org Artificial Intelligence

In this paper, we introduce a simple yet effective approach that can boost the vanilla ResNet-50 to 80%+ Top-1 accuracy on ImageNet without any tricks. Generally, our method is based on the recently proposed MEAL, i.e., ensemble knowledge distillation via discriminators. We further simplify it through 1) adopting the similarity loss and discriminator only on the final outputs and 2) using the average of softmax probabilities from all teacher ensembles as the stronger supervision for distillation. One crucial perspective of our method is that the one-hot/hard label should not be used in the distillation process. We show that such a simple framework can achieve state-of-the-art results without involving any commonly-used techniques, such as 1) architecture modification; 2) outside training data beyond ImageNet; 3) autoaug/randaug; 4) cosine learning rate; 5) mixup/cutmix training; 6) label smoothing; etc. On ImageNet, our method obtains 80.67% top-1 accuracy using a single crop-size of 224X224 on the vanilla ResNet-50, outperforming the previous state-of-the-arts by a remarkable margin under the same network structure. Our result can be regarded as a new strong baseline on ResNet-50 using knowledge distillation. To our best knowledge, this is the first work that is able to boost vanilla ResNet-50 to surpass 80% on ImageNet without architecture modification or additional training data. Our code and models are available at: https://github.com/szq0214/MEAL-V2.


Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

arXiv.org Machine Learning

For example, both the PyramidNet-110 model [23] and the larger PyramidNet-Knowledge distillation has been used to transfer 200 model achieve perfect accuracy on the CIFAR100 [32] knowledge learned by a sophisticated model (teacher) to training set, while the latter has 3% higher generalization a simpler model (student). This technique is widely used to accuracy. This motivated transferring the "knowledge" compress model complexity. However, in most applications encoded in the more accurate larger model to the smaller the compressed student model suffers from an accuracy gap one. Knowledge Distillation [8, 27] (KD) established with its teacher. We propose extracurricular learning, a an important mechanism through which one model novel knowledge distillation method, that bridges this gap (typically of higher capacity, called teacher) can train by (1) modeling student and teacher output distributions; another model (typically a smaller model that satisfies (2) sampling examples from an approximation to the the computational budget, called student). KD has been underlying data distribution; and (3) matching student and implemented in many machine learning tasks, for example teacher output distributions over this extended set including image classification [27], object detection [12, 65], video uncertain samples. We conduct rigorous evaluations on labeling [74], natural language processing [60, 41, 57, 36, regression and classification tasks and show that compared 61], and speech recognition [11, 59, 37]. to the standard knowledge distillation, extracurricular The idea of KD is to encourage the student to imitate learning reduces the gap by 46% to 68%. This leads to teacher's behavior over a set of data points, called transferset.


An Overview of Neural Network Compression

arXiv.org Machine Learning

Overparameterized networks trained to convergence have shown impressive performance in domains such as computer vision and natural language processing. Pushing state of the art on salient tasks within these domains corresponds to these models becoming larger and more difficult for machine learning practitioners to use given the increasing memory and storage requirements, not to mention the larger carbon footprint. Thus, in recent years there has been a resurgence in model compression techniques, particularly for deep convolutional neural networks and self-attention based networks such as the Transformer. Hence, this paper provides a timely overview of both old and current compression techniques for deep neural networks, including pruning, quantization, tensor decomposition, knowledge distillation and combinations thereof. We assume a basic familiarity with deep learning architectures\footnote{For an introduction to deep learning, see ~\citet{goodfellow2016deep}}, namely, Recurrent Neural Networks~\citep[(RNNs)][]{rumelhart1985learning,hochreiter1997long}, Convolutional Neural Networks~\citep{fukushima1980neocognitron}~\footnote{For an up to date overview see~\citet{khan2019survey}} and Self-Attention based networks~\citep{vaswani2017attention}\footnote{For a general overview of self-attention networks, see ~\citet{chaudhari2019attentive}.},\footnote{For more detail and their use in natural language processing, see~\citet{hu2019introductory}}. Most of the papers discussed are proposed in the context of at least one of these DNN architectures.