Belkin, Mikhail
Multiple Descent: Design Your Own Generalization Curve
Chen, Lin, Min, Yifei, Belkin, Mikhail, Karbasi, Amin
This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recently observed double descent curve are not intrinsic properties of the model family. Instead, their emergence is due to the interaction between the properties of the data and the inductive biases of learning algorithms.
On the linearity of large non-linear models: when and why the tangent kernel is constant
Liu, Chaoyue, Zhu, Libin, Belkin, Mikhail
The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted "lazy training". Furthermore, we show that the transition to linearity is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear. It is also not necessary for successful optimization by gradient descent.
Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors
Radhakrishnan, Adityanarayanan, Belkin, Mikhail, Uhler, Caroline
The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in l2-norm in the dual space, thereby generalizing the result of Azizan, Lale, and Hassibi (2019) in the full batch setting.
Classification vs regression in overparameterized regimes: Does the loss function matter?
Muthukumar, Vidya, Narang, Adhyyan, Subramanian, Vignesh, Belkin, Mikhail, Hsu, Daniel, Sahai, Anant
Paradigmatic problems in supervised machine learning (ML) involve predicting an output response from an input, based on patterns extracted from a (training) dataset. In classification, the output response is (finitely) discrete and we need to classify input data into one of these discrete categories. In regression, the output is continuous, typically a real number or a vector. Owing to this important distinction in output response, the two tasks are typically treated differently. The differences in treatment manifest in two phases of modern ML: optimization (training), which consists of an algorithmic procedure to extract a predictor from the training data, typically by minimizing the training loss (also called empirical risk); and generalization (testing), which consists of an evaluation of the obtained predictor on a separate test, or validation, dataset. Traditionally, the choice of loss functions for both phases is starkly different across classification and regression tasks. The squared-loss function is typically used both for the training and the testing phases in regression. In contrast, the hinge or logistic (cross-entropy for multi-class problems) loss functions are typically used in the training phase of classification, while the very different 0-1 loss function is used for testing.
Overparameterized Neural Networks Can Implement Associative Memory
Radhakrishnan, Adityanarayanan, Belkin, Mikhail, Uhler, Caroline
Identifying computational mechanisms for memorization and retrieval is a long-standing problem at the intersection of machine learning and neuroscience. In this work, we demonstrate empirically that overparameterized deep neural networks trained using standard optimization methods provide a mechanism for memorization and retrieval of real-valued data. In particular, we show that overparameterized autoencoders store training examples as attractors, and thus, can be viewed as implementations of associative memory with the retrieval mechanism given by iterating the map. We study this phenomenon under a variety of common architectures and optimization methods and construct a network that can recall 500 real-valued images without any apparent spurious attractor states. Lastly, we demonstrate how the same mechanism allows encoding sequences, including movies and audio, instead of individual examples. Interestingly, this appears to provide an even more efficient mechanism for storage and retrieval than autoencoding single instances.
Two models of double descent for weak features
Belkin, Mikhail, Hsu, Daniel, Xu, Ji
The "double descent" risk curve was proposed by Belkin, Hsu, Ma, and Mandal [Bel 18] to qualitatively describe the out-of-sample prediction performance of several variably-parameterized machine learning models. This risk curve reconciles the classical bias-variance tradeoff with the behavior of predictive models that interpolate training data, as observed for several model families (including neural networks) in a wide variety of applications [BO98; AS17; Spi 18; Bel 18]. In these studies, a predictive model with p parameters is fit to a training sample of size n, and the test risk (i.e., out-of-sample error) is examined as a function of p. When p is below the sample size n, the test risk is governed by the usual bias-variance decomposition. As p is increased towards n, the training risk (i.e., in-sample error) is driven to zero, but the test risk shoots up towards infinity.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
Belkin, Mikhail, Hsu, Daniel J., Mitra, Partha
Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for ``overfitted'' / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. Moreover, the nearest neighbor schemes exhibit optimal rates under some standard statistical assumptions. Finally, this paper suggests a way to explain the phenomenon of adversarial examples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpolated regime.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
Belkin, Mikhail, Hsu, Daniel J., Mitra, Partha
Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for ``overfitted'' / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. Moreover, the nearest neighbor schemes exhibit optimal rates under some standard statistical assumptions. Finally, this paper suggests a way to explain the phenomenon of adversarial examples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpolated regime.
Reconciling modern machine learning and the bias-variance trade-off
Belkin, Mikhail, Hsu, Daniel, Ma, Siyuan, Mandal, Soumik
The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology. The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample. However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good out-of-sample accuracy even for noisy data. How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice? In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional U-shaped bias-variance curve beyond the point of interpolation. Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training sample---a point that we call the "interpolation threshold"---the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited.
MaSS: an Accelerated Stochastic Method for Over-parametrized Learning
Liu, Chaoyue, Belkin, Mikhail
Stochastic gradient based methods are dominant in optimization for most large-scale machine learning problems, due to the simplicity of computation and their compatibility with modern parallel hardware, such as GPU. In most cases these methods use over-parametrized models allowing for interpolation, i.e., perfect fitting of the training data. While we do not yet have a full understanding of why these solutions generalize (as indicated by a wealth of empirical evidence, e.g., [22, 2]) we are beginning to recognize their desirable properties for optimization, particularly in the SGD setting [11]. In this paper, we leverage the power of the interpolated setting to propose MaSS (Momentum-added Stochastic Solver), a stochastic momentum method for efficient training of over-parametrized models. See pseudo code in Appendix A. The algorithm keeps two variables (weights)w andu .