Goto

Collaborating Authors

 Belius, David


A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression

arXiv.org Artificial Intelligence

This paper conducts a comprehensive study of the learning curves of kernel ridge regression (KRR) under minimal assumptions. Our contributions are three-fold: 1) we analyze the role of key properties of the kernel, such as its spectral eigen-decay, the characteristics of the eigenfunctions, and the smoothness of the kernel; 2) we demonstrate the validity of the Gaussian Equivalent Property (GEP), which states that the generalization performance of KRR remains the same when the whitened features are replaced by standard Gaussian vectors, thereby shedding light on the success of previous analyzes under the Gaussian Design Assumption; 3) we derive novel bounds that improve over existing bounds across a broad range of setting such as (in)dependent feature vectors and various combinations of eigen-decay rates in the over/underparameterized regimes.


Characterizing Overfitting in Kernel Ridgeless Regression Through the Eigenspectrum

arXiv.org Artificial Intelligence

Kernel regression plays a pivotal role in machine learning since it offers an expressive and rapidly trainable framework for modeling complex relationships in data. In recent years, kernels have regained significance in deep learning theory since many deep neural networks (DNNs) can be understood as converging to certain kernel limits. Its significance has been underscored by its ability to approximate deep neural network (DNN) training under certain conditions, providing a tractable avenue for analytical exploration of test error and robust theoretical guarantees Jacot et al. [2018], Arora et al. [2019], Bordelon et al. [2020]. The adaptability of kernel regression positions it as a crucial tool in various machine learning applications, making it imperative to comprehensively understand its behavior, particularly concerning overfitting. Despite the increasing attention directed towards kernel ridge regression, the existing literature predominantly concentrates on overfitting phenomena in either the high input dimensional regime or the asymptotic regime Liang and Rakhlin [2020], Mei and Montanari [2022], Misiakiewicz [2022], also known as the ultra-high dimensional regime Zou and Zhang [2009], Fan et al. [2009]. Notably, the focus on asymptotic bounds, requiring the input dimension to approach infinity, may not align with the finite nature of real-world datasets and target functions. Similarly, classical Rademacher-based bounds, e.g. Bartlett and Mendelson [2002], require that the weights of the kernel regressor satisfy data-independent a-priori bounds, a restriction that is also not implemented in standard kernel ridge regression algorithms. These mismatches between idealized mathematical assumptions and practical implementation standards necessitate a more nuanced exploration of overfitting in kernel regression in a fixed input dimension.


A Theoretical Analysis of the Test Error of Finite-Rank Kernel Ridge Regression

arXiv.org Machine Learning

Generalization is a central theme in statistical learning theory. The recent renewed interest in kernel methods, especially in Kernel Ridge Regression (KRR), is largely due to the fact that deep neural network (DNN) training can be approximated using kernels under appropriate conditions Jacot et al. [2018], Arora et al. [2019], Bordelon et al. [2020], in which the test error is more tractable analytically and thus enjoys stronger theoretical guarantees. However, many prior results have been derived under conditions incompatible with practical settings. For instance Liang and Rakhlin [2020], Liu et al. [2021a], Mei et al. [2021], Misiakiewicz [2022] give asymptotic bounds on the KRR test error, which requires the input dimension d to tend to infinity. In reality, the input dimension of the data set and the target function is typically finite.


Injectivity of ReLU networks: perspectives from statistical physics

arXiv.org Artificial Intelligence

When can the input of a ReLU neural network be inferred from its output? In other words, when is the network injective? We consider a single layer, $x \mapsto \mathrm{ReLU}(Wx)$, with a random Gaussian $m \times n$ matrix $W$, in a high-dimensional setting where $n, m \to \infty$. Recent work connects this problem to spherical integral geometry giving rise to a conjectured sharp injectivity threshold for $\alpha = \frac{m}{n}$ by studying the expected Euler characteristic of a certain random set. We adopt a different perspective and show that injectivity is equivalent to a property of the ground state of the spherical perceptron, an important spin glass model in statistical physics. By leveraging the (non-rigorous) replica symmetry-breaking theory, we derive analytical equations for the threshold whose solution is at odds with that from the Euler characteristic. Furthermore, we use Gordon's min--max theorem to prove that a replica-symmetric upper bound refutes the Euler characteristic prediction. Along the way we aim to give a tutorial-style introduction to key ideas from statistical physics in an effort to make the exposition accessible to a broad audience. Our analysis establishes a connection between spin glasses and integral geometry but leaves open the problem of explaining the discrepancies.


On the Empirical Neural Tangent Kernel of Standard Finite-Width Convolutional Neural Network Architectures

arXiv.org Machine Learning

The Neural Tangent Kernel (NTK) is an important milestone in the ongoing effort to build a theory for deep learning. Its prediction that sufficiently wide neural networks behave as kernel methods, or equivalently as random feature models, has been confirmed empirically for certain wide architectures. It remains an open question how well NTK theory models standard neural network architectures of widths common in practice, trained on complex datasets such as ImageNet. We study this question empirically for two well-known convolutional neural network architectures, namely AlexNet and LeNet, and find that their behavior deviates significantly from their finite-width NTK counterparts. For wider versions of these networks, where the number of channels and widths of fully-connected layers are increased, the deviation decreases.