Goto

Collaborating Authors

 ridge estimator




Generalized equivalences between subsampling and ridge regularization

Neural Information Processing Systems

We establish precise structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators. Specifically, we prove that linear and quadratic functionals of subsample ridge estimators, when fitted with different ridge regularization levels $\lambda$ and subsample aspect ratios $\psi$, are asymptotically equivalent along specific paths in the $(\lambda,\psi)$-plane (where $\psi$ is the ratio of the feature dimension to the subsample size). Our results only require bounded moment assumptions on feature and response distributions and allow for arbitrary joint distributions. Furthermore, we provide a data-dependent method to determine the equivalent paths of $(\lambda,\psi)$. An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio. This resolves a recent open problem raised by Nakkiran et al. for general data distributions under proportional asymptotics, assuming a mild regularity condition that maintains regression hardness through linearized signal-to-noise ratios.


Implicit Regularization Paths of Weighted Neural Representations

Neural Information Processing Systems

We study the implicit regularization effects induced by (observation) weighting of pretrained features.For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels.Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms.These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features.For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in Patil and Du (2023).We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity.As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., ResNet-50) and datasets (e.g., CIFAR-100).


Implicit Regularization Paths of Weighted Neural Representations

Neural Information Processing Systems

We study the implicit regularization effects induced by (observation) weighting of pretrained features.For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels.Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms.These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features.For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in Patil and Du (2023).We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity.As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., ResNet-50) and datasets (e.g., CIFAR-100).


Transformer learns the cross-task prior and regularization for in-context learning

arXiv.org Machine Learning

Transformers have shown a remarkable ability for in-context learning (ICL), making predictions based on contextual examples. However, while theoretical analyses have explored this prediction capability, the nature of the inferred context and its utility for downstream predictions remain open questions. This paper aims to address these questions by examining ICL for inverse linear regression (ILR), where context inference can be characterized by unsupervised learning of underlying weight vectors. Focusing on the challenging scenario of rank-deficient inverse problems, where context length is smaller than the number of unknowns in the weight vectors and regularization is necessary, we introduce a linear transformer to learn the inverse mapping from contextual examples to the underlying weight vector. Our findings reveal that the transformer implicitly learns both a prior distribution and an effective regularization strategy, outperforming traditional ridge regression and regularization methods. A key insight is the necessity of low task dimensionality relative to the context length for successful learning. Furthermore, we numerically verify that the error of the transformer estimator scales linearly with the noise level, the ratio of task dimension to context length, and the condition number of the input data. These results not only demonstrate the potential of transformers for solving ill-posed inverse problems, but also provide a new perspective towards understanding the knowledge extraction mechanism within transformers.


Generalized equivalences between subsampling and ridge regularization

Neural Information Processing Systems

We establish precise structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators. Specifically, we prove that linear and quadratic functionals of subsample ridge estimators, when fitted with different ridge regularization levels \lambda and subsample aspect ratios \psi, are asymptotically equivalent along specific paths in the (\lambda,\psi) -plane (where \psi is the ratio of the feature dimension to the subsample size). Our results only require bounded moment assumptions on feature and response distributions and allow for arbitrary joint distributions. Furthermore, we provide a data-dependent method to determine the equivalent paths of (\lambda,\psi) . An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio. This resolves a recent open problem raised by Nakkiran et al. for general data distributions under proportional asymptotics, assuming a mild regularity condition that maintains regression hardness through linearized signal-to-noise ratios.


Lecture Notes on High Dimensional Linear Regression

arXiv.org Machine Learning

These lecture notes were developed for a Master's course in advanced machine learning at Erasmus University of Rotterdam. The course is designed for graduate students in mathematics, statistics and econometrics. The content follows a proposition-proof structure, making it suitable for students seeking a formal and rigorous understanding of the statistical theory underlying machine learning methods. At present, the notes focus on linear regression, with an in-depth exploration of the existence, uniqueness, relations, computation, and nonasymptotic properties of the most prominent estimators in this setting: least squares, ridgeless, ridge, and lasso. Background It is assumed that readers have a solid background in calculus, linear algebra, convex analysis, and probability theory.


Implicit Regularization Paths of Weighted Neural Representations

arXiv.org Machine Learning

In recent years, neural networks have become state-of-the-art models for tasks in computer vision and natural language processing by learning rich representations from large datasets. Pretrained neural networks, such as ResNet, which are trained on massive datasets like ImageNet, serve as valuable resources for new, smaller datasets [32]. These pretrained models reduce computational burden and generalize well in tasks such as image classification and object detection due to their rich feature space [32, 69]. Furthermore, pretrained features or neural embeddings, such as the neural tangent kernel, extracted from these models, serve as valuable representations of diverse data [33, 66]. However, despite their usefulness, fitting models based on pretrained features on large datasets can be challenging due to computational and memory constraints. When dealing with highdimensional pretrained features and large sample sizes, direct application of even simple linear regression may be computationally infeasible or memory-prohibitive [23, 44]. To address this issue, subsampling has emerged as a practical solution that reduces the dataset size, thereby alleviating the computational and memory burden. Subsampling involves creating smaller datasets by randomly selecting a subset of the original data points.


Precise analysis of ridge interpolators under heavy correlations -- a Random Duality Theory view

arXiv.org Machine Learning

While non-monotonic behavior in parametric characterizations of various random structures has been known for a long time, it has received an enormous amount of attention in recent years. By a no surprise, a larger than ever popularity of machine learning (ML) and neural networks (NN) substantially contributed to this. Two lines of work, the neural network one (see, e.g., [5, 7, 8, 41, 60, 61]) and the statistical one (see, e.g., [6,16,17,24]), together with their interconnections, are among those that, in our view, led the way in bringing such a strong interest to these phenomena. The first line, (in a way (re)initiated in [5, 7, 8, 41, 60, 61] (and further theoretically substantiated in e.g., [6, 36])), relates to the empirical observation of the so-called double-descent phenomenon in neural networks generalization abilities. In particular, it was emphasized in [5] that the generalization error has a U-shape dependence on the size of the network before and a (potentially surprising) second descent after the interpolating limit (see, also [1, 36, 37, 60, 61] and particularly [56] for early double-decent displays).