overparameterized network
The Effect of Label Noise on the Information Content of Neural Representations
Umar, Ali Hussaini, Tezoh, Franky Kevin Nando, Barbier, Jean, Acevedo, Santiago, Laio, Alessandro
In supervised classification tasks, models are trained to predict a label for each data point. In real-world datasets, these labels are often noisy due to annotation errors. While the impact of label noise on the performance of deep learning models has been widely studied, its effects on the networks' hidden representations remain poorly understood. We address this gap by systematically comparing hidden representations using the Information Imbalance, a computationally efficient proxy of conditional mutual information. Through this analysis, we observe that the information content of the hidden representations follows a double descent as a function of the number of network parameters, akin to the behavior of the test error. We further demonstrate that in the underparameterized regime, representations learned with noisy labels are more informative than those learned with clean labels, while in the overparameterized regime, these representations are equally informative. Our results indicate that the representations of overparameterized networks are robust to label noise. We also found that the information imbalance between the penultimate and pre-softmax layers decreases with cross-entropy loss in the overparameterized regime. This offers a new perspective on understanding generalization in classification tasks. Extending our analysis to representations learned from random labels, we show that these perform worse than random features. This indicates that training on random labels drives networks much beyond lazy learning, as weights adapt to encode labels information.
Reviews: Convergence of Adversarial Training in Overparametrized Neural Networks
EDIT: I have read the author feedback and the authors have agreed to revise the writing. This is clearly a good paper that should be accepted. Two more comments regarding the rebuttal: (1) My original comments apply to natural training as well, and I understand this is a very challenging topic. For example, one such parameter is width. As far as I know, this is also first such results for adversarial training.
A General Framework of the Consistency for Large Neural Networks
Neural networks have shown remarkable success, especially in overparameterized or "large" models. Despite increasing empirical evidence and intuitive understanding, a formal mathematical justification for the behavior of such models, particularly regarding overfitting, remains incomplete. In this paper, we propose a general regularization framework to study the Mean Integrated Squared Error (MISE) of neural networks. This framework includes many commonly used neural networks and penalties, such as ReLu and Sigmoid activations and $L^1$, $L^2$ penalties. Based on our frameworks, we find the MISE curve has two possible shapes, namely the shape of double descents and monotone decreasing. The latter phenomenon is new in literature and the causes of these two phenomena are also studied in theory. These studies challenge conventional statistical modeling frameworks and broadens recent findings on the double descent phenomenon in neural networks.
How Does Overparameterization Affect Features?
Duzgun, Ahmet Cagri, Jelassi, Samy, Li, Yuanzhi
Overparameterization, the condition where models have more parameters than necessary to fit their training loss, is a crucial factor for the success of deep learning. However, the characteristics of the features learned by overparameterized networks are not well understood. In this work, we explore this question by comparing models with the same architecture but different widths. We first examine the expressivity of the features of these models, and show that the feature space of overparameterized networks cannot be spanned by concatenating many underparameterized features, and vice versa. This reveals that both overparameterized and underparameterized networks acquire some distinctive features. We then evaluate the performance of these models, and find that overparameterized networks outperform underparameterized networks, even when many of the latter are concatenated. We corroborate these findings using a VGG-16 and ResNet18 on CIFAR-10 and a Transformer on the MNLI classification dataset. Finally, we propose a toy setting to explain how overparameterized networks can learn some important features that the underparamaterized networks cannot learn. Overparameterized neural networks, which have more parameters than necessary to fit the training data, have achieved remarkable success in various tasks, such as image classification (He et al., 2016; Krizhevsky et al., 2017), object detection (Girshick et al., 2014; Redmon et al., 2016) or text classification (Zhang et al., 2015; Johnson & Zhang, 2016). However, the theoretical understanding of why these networks outperform underparameterized ones, which have fewer parameters and less capacity, is still limited.
Random Search as a Baseline for Sparse Neural Network Architecture Search
Overparameterized neural networks are loosely characterized as networks that have a very high fitting (or memorization) capacity with respect to their training data. Although capable of memorization of their training data, these networks intriguingly achieve very low test error close to their training error rates [1, 2]. Meanwhile, sparse neural networks have shown similar or better generalization performance than their dense counterparts while having higher parameter efficiency [3]. With increasing availability of hardware and software that support sparse computational operations [4, 5], there has been a growing interest in finding sparse sub-networks within large overparameterized models to either improve generalization performance or to gain computational efficiency at the same performance level [6, 7, 8, 3]. Earlier works on creating efficient sparse sub-networks include the now popular pruning technique [9]. These were motivated by the desire to achieve compute efficiency in resource constraint applications by finding smaller networks within a larger network space without losing task performance quality [10]. The original pruning technique involves fully training a larger network on some task, discarding the task-irrelevant connections, and then fine-tuning the remaining sparse sub-network on the task to achieve the a level of performance near that of the larger network. Connections were originally pruned based on loss Hessians [9, 11]. Later on, other techniques were proposed such as the removal of weak connections [12] based on weight value thresholds.