Plotting

 Belkin, Mikhail


On the Inconsistency of Kernel Ridgeless Regression in Fixed Dimensions

arXiv.org Artificial Intelligence

``Benign overfitting'', the ability of certain algorithms to interpolate noisy training data and yet perform well out-of-sample, has been a topic of considerable recent interest. We show, using a fixed design setup, that an important class of predictors, kernel machines with translation-invariant kernels, does not exhibit benign overfitting in fixed dimensions. In particular, the estimated predictor does not converge to the ground truth with increasing sample size, for any non-zero regression function and any (even adaptive) bandwidth selection. To prove these results, we give exact expressions for the generalization error, and its decomposition in terms of an approximation error and an estimation error that elicits a trade-off based on the selection of the kernel bandwidth. Our results apply to commonly used translation-invariant kernels such as Gaussian, Laplace, and Cauchy.


Cut your Losses with Squentropy

arXiv.org Artificial Intelligence

Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.


Restricted Strong Convexity of Deep Learning Models with Smooth Activations

arXiv.org Artificial Intelligence

We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the ``near initialization'' perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with $L$ layers, $m$ width, and $\sigma_0^2$ initialization variance. First, for suitable $\sigma_0^2$, we establish a $O(\frac{\text{poly}(L)}{\sqrt{m}})$ upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss. We also present results for more general losses. The RSC based analysis does not need the ``near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.


A note on Linear Bottleneck networks and their Transition to Multilinearity

arXiv.org Machine Learning

For a wide neural network (WNN), when the network width is sufficiently large, there exists a linear function of parameters, arbitrarily close to the network function, in a ball of radius O(1) in the parameter space around random initialization. This local linearity explains the equivalence to the neural tangent kernel (NTK) regression for optimizing wide neural networks with small learning rates, first shown in [13]. However, an important assumption for this transition to linearity [18] to hold is that each layer must be sufficiently wide. If there is even one narrow "bottleneck" hidden layer, resulting in a so-called bottleneck neural network (BNN), the work [18] showed that the transition to linearity does not occur. An immediate question at this point is, What functions of the weights does a neural network with a bottleneck layer represent?


Wide and Deep Neural Networks Achieve Optimality for Classification

arXiv.org Artificial Intelligence

While neural networks are used for classification tasks across domains, a long-standing open problem in machine learning is determining whether neural networks trained using standard procedures are optimal for classification, i.e., whether such models minimize the probability of misclassification for arbitrary data distributions. In this work, we identify and construct an explicit set of neural network classifiers that achieve optimality. Since effective neural networks in practice are typically both wide and deep, we analyze infinitely wide networks that are also infinitely deep. In particular, using the recent connection between infinitely wide neural networks and Neural Tangent Kernels, we provide explicit activation functions that can be used to construct networks that achieve optimality. Interestingly, these activation functions are simple and easy to implement, yet differ from commonly used activations such as ReLU or sigmoid. More generally, we create a taxonomy of infinitely wide and deep networks and show that these models implement one of three well-known classifiers depending on the activation function used: (1) 1-nearest neighbor (model predictions are given by the label of the nearest training example); (2) majority vote (model predictions are given by the label of the class with greatest representation in the training set); or (3) singular kernel classifiers (a set of classifiers containing those that achieve optimality). Our results highlight the benefit of using deep networks for classification tasks, in contrast to regression tasks, where excessive depth is harmful.


Limitations of Neural Collapse for Understanding Generalization in Deep Learning

arXiv.org Machine Learning

The recent work of Papyan, Han, & Donoho (2020) presented an intriguing "Neural Collapse" phenomenon, showing a structural property of interpolating classifiers in the late stage of training. This opened a rich area of exploration studying this phenomenon. Our motivation is to study the upper limits of this research program: How far will understanding Neural Collapse take us in understanding deep learning? First, we investigate its role in generalization. We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property). We find that while Neural Collapse often occurs on the train set, it does not occur on the test set. We thus conclude that Neural Collapse is primarily an optimization phenomenon, with as-yet-unclear connections to generalization. Second, we investigate the role of Neural Collapse in feature learning. We show simple, realistic experiments where training longer leads to worse last-layer features, as measured by transfer-performance on a downstream task. This suggests that neural collapse is not always desirable for representation learning, as previously claimed. Finally, we give preliminary evidence of a "cascading collapse" phenomenon, wherein some form of Neural Collapse occurs not only for the last layer, but in earlier layers as well. We hope our work encourages the community to continue the rich line of Neural Collapse research, while also considering its inherent limitations.


Benign Overfitting in Two-layer Convolutional Neural Networks

arXiv.org Machine Learning

Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.


Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

arXiv.org Machine Learning

In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation, and its sibling, over-parameterization. Interpolation corresponds to fitting data, even noisy data, exactly. Over-parameterization enables interpolation and provides flexibility to select a right interpolating model. As we will see, just as a physical prism separates colors mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern Machine Learning. This article is written with belief and hope that clearer understanding of these issues brings us a step closer toward a general theory of deep learning and machine learning.


Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures

arXiv.org Machine Learning

In modern machine learning, complex models such as deep neural networks have received increasing popularity. These complicated models are known to be able to fit noisy training data sets, while at the same time achieving small test errors. In fact, this benign overfitting phenomenon is not a unique feature of deep learning. Even for kernel methods and linear models, Belkin et al. (2018) demonstrated that interpolators on the noisy training data can still perform near optimally on the test data. A series of recent works (Belkin et al., 2019b; Muthukumar et al., 2020b; Hastie et al., 2019; Bartlett et al., 2020) theoretically studied how over-parameterization can achieve small population risk. In particular in Bartlett et al. (2020) the authors considered the setting where the data are generated from a ground-truth linear model with noise, and established a tight population risk bound for the minimum norm linear interpolator with a matching lower bound. More recently, Tsigler and Bartlett (2020) further studied benign overfitting in ridge regression, and established non-asymptotic generalization bounds for over-parametrized ridge regression. They showed that those bounds are tight for a range of regularization parameter values. Notably, these results cover arbitrary covariance structure of the data, and give a nice characterization of how the spectrum of the data covariance matrix affects the population risk in the over-parameterized regime.


Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

arXiv.org Machine Learning

Modern neural architectures for classification tasks are trained using the crossentropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be wellfounded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. Furthermore, training with square loss appears to be less sensitive to the randomness in initialization. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy. Modern deep neural networks are nearly universally trained with cross-entropy loss in classification tasks.