Goto

Collaborating Authors

 self-adaptive training



A Experimental Setups A.1 Double descent phenomenon Following previous work [

Neural Information Processing Systems

Accuracy curves of model trained using ERM. Figure 7: Accuracy curves of model trained on noisy CIFAR10 training set with 80% noise rate. For training, we use initial learning rate of 0.1, batch size of 128, 100 training epochs. We split the training set into two portions: 1) Untouched portion, i.e., the elements in the training set which were left untouched; 2) Corrupted portion, i.e., the elements in The learning rate is linearly increased from 0.0003 Following common practice, we use random resizing, cropping and flipping augmentation during training. However, they only analyzed the generalization errors in the presence of corrupted labels. This occurs around the epochs between underfitting and overfitting.



Review for NeurIPS paper: Self-Adaptive Training: beyond Empirical Risk Minimization

Neural Information Processing Systems

Weaknesses: The main weakness of the proposed approached is that it is not supported by any theoretical insight. In particular, the success of the method hinges on the premise that the model is able to guess the right predictions so as to correct the noisy labels. Since there is no theoretical criterion to verify that premise, it is not possible to predict whether this proposed method will work well on new learning tasks. Going further, one can imagine cases where this method would fail and actually perform worse than ERM. For instance, if the model is unable to capture sufficient information from the data distribution (for instance if the data distribution is very complex and / or if there are too few training samples and / or if the model does not have sufficient capacity), it would be impossible for the model to "bootstrap" its own predictions and guess the correct labels.


Review for NeurIPS paper: Self-Adaptive Training: beyond Empirical Risk Minimization

Neural Information Processing Systems

The paper focuses on the problem of learning from corrupted data (e.g. This objective can be interpreted as a self-training whereby the model's predictions are progressively averaged with the true (and possibly noisy labels) coupled with a sample weighting scheme which improves training stability. The authors show that this approach can be used for a variety of vision tasks, including classification under label noise, adversarial training, and selective classification. The reviewers appreciated the conceptual simplicity of the method, the clarity of the presentation, and the promising empirical results. The discussion phase focused on the following two drawbacks: - Theoretical justification: While the theoretical analysis is hard for the general case, it might be doable in the corrupted linear regression case, which could offer some valuable insights.


Self-Adaptive Training: beyond Empirical Risk Minimization

Neural Information Processing Systems

We propose self-adaptive training---a new training algorithm that dynamically calibrates training process by model predictions without incurring extra computational cost---to improve generalization of deep learning for potentially corrupted training data. This problem is important to robustly learning from data that are corrupted by, e.g., random noises and adversarial examples. The standard empirical risk minimization (ERM) for such data, however, may easily overfit noises and thus suffers from sub-optimal performance. In this paper, we observe that model predictions can substantially benefit the training process: self-adaptive training significantly mitigates the overfitting issue and improves generalization over ERM under both random and adversarial noises. Besides, in sharp contrast to the recently-discovered double-descent phenomenon in ERM, self-adaptive training exhibits a single-descent error-capacity curve, indicating that such a phenomenon might be a result of overfitting of noises.


Training Private Models That Know What They Don't Know

arXiv.org Artificial Intelligence

Training reliable deep learning models which avoid making overconfident but incorrect predictions is a longstanding challenge. This challenge is further exacerbated when learning has to be differentially private: protection provided to sensitive data comes at the price of injecting additional randomness into the learning process. In this work, we conduct a thorough empirical investigation of selective classifiers -- that can abstain when they are unsure -- under a differential privacy constraint. We find that several popular selective prediction approaches are ineffective in a differentially private setting as they increase the risk of privacy leakage. At the same time, we identify that a recent approach that only uses checkpoints produced by an off-the-shelf private learning algorithm stands out as particularly suitable under DP. Further, we show that differential privacy does not just harm utility but also degrades selective classification performance. To analyze this effect across privacy levels, we propose a novel evaluation mechanism which isolate selective prediction performance across model utility levels. Our experimental results show that recovering the performance level attainable by non-private models is possible but comes at a considerable coverage cost as the privacy budget decreases.


Towards Better Selective Classification

arXiv.org Artificial Intelligence

We tackle the problem of Selective Classification where the objective is to achieve the best performance on a predetermined ratio (coverage) of the dataset. Recent state-of-the-art selective methods come with architectural changes either via introducing a separate selection head or an extra abstention logit. In this paper, we challenge the aforementioned methods. The results suggest that the superior performance of state-of-the-art methods is owed to training a more generalizable classifier rather than their proposed selection mechanisms. We argue that the best performing selection mechanism should instead be rooted in the classifier itself. Our proposed selection strategy uses the classification scores and achieves better results by a significant margin, consistently, across all coverages and all datasets, without any added compute cost. Furthermore, inspired by semi-supervised learning, we propose an entropy-based regularizer that improves the performance of selective classification methods. Our proposed selection mechanism with the proposed entropy-based regularizer achieves new state-of-the-art results. A model's ability to abstain from a decision when lacking confidence is essential in mission-critical applications. This is known as the Selective Prediction problem setting. The abstained and uncertain samples can be flagged and passed to a human expert for manual assessment, which, in turn, can improve the re-training process. This is crucial in problem settings where confidence is critical or an incorrect prediction can have significant consequences such as in the financial, medical, or autonomous driving domains. Several papers have tried to address this problem by estimating the uncertainty in the prediction.


Self-Adaptive Training: Bridging Supervised and Self-Supervised Learning

arXiv.org Artificial Intelligence

We propose self-adaptive training -- a unified training algorithm that dynamically calibrates and enhances training processes by model predictions without incurring an extra computational cost -- to advance both supervised and self-supervised learning of deep neural networks. We analyze the training dynamics of deep networks on training data that are corrupted by, e.g., random noise and adversarial examples. Our analysis shows that model predictions are able to magnify useful underlying information in data and this phenomenon occurs broadly even in the absence of any label information, highlighting that model predictions could substantially benefit the training processes: self-adaptive training improves the generalization of deep networks under noise and enhances the self-supervised representation learning. The analysis also sheds light on understanding deep learning, e.g., a potential explanation of the recently-discovered double-descent phenomenon in empirical risk minimization and the collapsing issue of the state-of-the-art self-supervised learning algorithms. Experiments on the CIFAR, STL, and ImageNet datasets verify the effectiveness of our approach in three applications: classification with label noise, selective classification, and linear evaluation. To facilitate future research, the code has been made publicly available at https://github.com/LayneH/self-adaptive-training.


Generalization by Recognizing Confusion

arXiv.org Machine Learning

A recently-proposed technique called self-adaptive training augments modern neural networks by allowing them to adjust training labels on the fly, to avoid overfitting to samples that may be mislabeled or otherwise non-representative. By combining the self-adaptive objective with mixup, we further improve the accuracy of self-adaptive models for image recognition; the resulting classifier obtains state-of-the-art accuracies on datasets corrupted with label noise. Robustness to label noise implies a lower generalization gap; thus, our approach also leads to improved generalizability. We find evidence that the Rademacher complexity of these algorithms is low, suggesting a new path towards provable generalization for this type of deep learning model. Last, we highlight a novel connection between difficulties accounting for rare classes and robustness under noise, as rare classes are in a sense indistinguishable from label noise. Our code can be found at https://github.com/Tuxianeer/generalizationconfusion.