Education
Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again
Knowledge Distillation (KD) aims at transferring the knowledge of a wellperformed neural network (the teacher) to a weaker one (the student). A peculiar phenomenon is that a more accurate model doesn't necessarily teach better, and temperature adjustment can neither alleviate the mismatched capacity. To explain this, we decompose the efficacy of KD into three parts: correct guidance, smooth regularization, and class discriminability. The last term describes the distinctness of wrong class probabilities that the teacher provides in KD. Complex teachers tend to be over-confident and traditional temperature scaling limits the efficacy of class discriminability, resulting in less discriminative wrong class probabilities. Therefore, we propose Asymmetric Temperature Scaling (ATS), which separately applies a higher/lower temperature to the correct/wrong class. ATS enlarges the variance of wrong class probabilities in the teacher's label and makes the students grasp the absolute affinities of wrong classes to the target class as discriminative as possible. Both theoretical analysis and extensive experimental results demonstrate the effectiveness of ATS.
Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again
Knowledge Distillation (KD) aims at transferring the knowledge of a wellperformed neural network (the teacher) to a weaker one (the student). A peculiar phenomenon is that a more accurate model doesn't necessarily teach better, and temperature adjustment can neither alleviate the mismatched capacity. To explain this, we decompose the efficacy of KD into three parts: correct guidance, smooth regularization, and class discriminability. The last term describes the distinctness of wrong class probabilities that the teacher provides in KD. Complex teachers tend to be over-confident and traditional temperature scaling limits the efficacy of class discriminability, resulting in less discriminative wrong class probabilities. Therefore, we propose Asymmetric Temperature Scaling (ATS), which separately applies a higher/lower temperature to the correct/wrong class. ATS enlarges the variance of wrong class probabilities in the teacher's label and makes the students grasp the absolute affinities of wrong classes to the target class as discriminative as possible. Both theoretical analysis and extensive experimental results demonstrate the effectiveness of ATS.
ACloser Look at Learned Optimization: Stability, Robustness, and Inductive Biases
Learned optimizers--neural networks that are trained to act as optimizers--have the potential to dramatically accelerate training of machine learning models. However, even when meta-trained across thousands of tasks at huge computational expense, blackbox learned optimizers often struggle with stability and generalization when applied to tasks unlike those in their meta-training set. In this paper, we use tools from dynamical systems to investigate the inductive biases and stability properties of optimization algorithms, and apply the resulting insights to designing inductive biases for blackbox optimizers. Our investigation begins with a noisy quadratic model, where we characterize conditions in which optimization is stable, in terms of eigenvalues of the training dynamics. We then introduce simple modifications to a learned optimizer's architecture and meta-training procedure which lead to improved stability, and improve the optimizer's inductive bias. We apply the resulting learned optimizer to a variety of neural network training tasks, where it outperforms the current state of the art learned optimizer--at matched optimizer computational overhead--with regard to optimization performance and meta-training speed, and is capable of generalization to tasks far different from those it was meta-trained on.
ReSSL: Relational Self-Supervised Learning with Weak Augmentation
Self-supervised Learning (SSL) including the mainstream contrastive learning has achieved great success in learning visual representations without data annotations. However, most of methods mainly focus on the instance level information (i.e., the different augmented images of the same instance should have the same feature or cluster into the same class), but there is a lack of attention on the relationships between different instances. In this paper, we introduced a novel SSL paradigm, which we term as relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Specifically, our proposed method employs sharpened distribution of pairwise similarities among different instances as relation metric, which is thus utilized to match the feature embeddings of different augmentations. Moreover, to boost the performance, we argue that weak augmentations matter to represent a more reliable relation, and leverage momentum strategy for practical efficiency. Experimental results show that our proposed ReSSL significantly outperforms the previous stateof-the-art algorithms in terms of both performance and training efficiency.
Learning Predictions for Algorithms with Predictions
A burgeoning paradigm in algorithm design is the field of algorithms with predictions, in which algorithms can take advantage of a possibly-imperfect prediction of some aspect of the problem. While much work has focused on using predictions to improve competitive ratios, running times, or other performance measures, less effort has been devoted to the question of how to obtain the predictions themselves, especially in the critical online setting. We introduce a general design approach for algorithms that learn predictors: (1) identify a functional dependence of the performance measure on the prediction quality and (2) apply techniques from online learning to learn predictors, tune robustness-consistency trade-offs, and bound the sample complexity. We demonstrate the effectiveness of our approach by applying it to bipartite matching, ski-rental, page migration, and job scheduling. In several settings we improve upon multiple existing results while utilizing a much simpler analysis, while in the others we provide the first learning-theoretic guarantees.
Preserved central model for faster bidirectional compression in distributed settings
We develop a new approach to tackle communication constraints in a distributed learning problem with a central server. We propose and analyze a new algorithm that performs bidirectional compression and achieves the same convergence rate as algorithms using only uplink (from the local workers to the central server) compression. To obtain this improvement, we design MCM, an algorithm such that the downlink compression only impacts local models, while the global model is preserved. As a result, and contrary to previous works, the gradients on local servers are computed on perturbed models. Consequently, convergence proofs are more challenging and require a precise control of this perturbation. To ensure it, MCMadditionally combines model compression with a memory mechanism. This analysis opens new doors, e.g.