Goto

Collaborating Authors

 inner product kernel


On the Pinsker bound of inner product kernel regression in large dimensions

Lu, Weihao, Ding, Jialin, Zhang, Haobo, Lin, Qian

arXiv.org Machine Learning

This intriguing phenomenon, where the two asymptotics are equal, was rigorously justified by the seminal work on Le Cam equivalence. These work established the asymptotic equivalence between Gaussian sequence models, the white noise model, and certain nonparametric regression models (see, e.g., [3, 4, 5]). Since then, subsequent studies have established similar exact risks for a variety of nonparametric estimation problems. These include density estimation, regression models with non-Gaussian noise or random designs, analysis of Besov bodies, and wavelet estimation (e.g., [6, 7, 8, 2, 9, 10, 11, 12, 13]). For a detailed review of these developments, one can refer to [14] and the references therein. Constants akin to β(m, R), now often referred to as the Pinsker constant, play an indispensable role in studying the super-efficiency phenomenon observed in nonparametric problems. This phenomenon has been the subject of extensive investigation (e.g., [15, 16, 17, 18]). Recently, the strong theoretical links between the training dynamics within wide neural networks and the corresponding neural tangent kernel in regression have motivated substantial research into understanding the performance of spectral algorithms, such as kernel ridge regression and kernel gradient descent, in the context of kernel regression problems (see, e.g., [19, 20, 21, 22, 23, 24]).


The phase diagram of kernel interpolation in large dimensions

Zhang, Haobo, Lu, Weihao, Lin, Qian

arXiv.org Machine Learning

The generalization ability of kernel interpolation in large dimensions (i.e., $n \asymp d^{\gamma}$ for some $\gamma>0$) might be one of the most interesting problems in the recent renaissance of kernel regression, since it may help us understand the 'benign overfitting phenomenon' reported in the neural networks literature. Focusing on the inner product kernel on the sphere, we fully characterized the exact order of both the variance and bias of large-dimensional kernel interpolation under various source conditions $s\geq 0$. Consequently, we obtained the $(s,\gamma)$-phase diagram of large-dimensional kernel interpolation, i.e., we determined the regions in $(s,\gamma)$-plane where the kernel interpolation is minimax optimal, sub-optimal and inconsistent.


Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions

Zhang, Haobo, Li, Yicheng, Lu, Weihao, Lin, Qian

arXiv.org Artificial Intelligence

Motivated by the studies of neural networks (e.g.,the neural tangent kernel theory), we perform a study on the large-dimensional behavior of kernel ridge regression (KRR) where the sample size $n \asymp d^{\gamma}$ for some $\gamma > 0$. Given an RKHS $\mathcal{H}$ associated with an inner product kernel defined on the sphere $\mathbb{S}^{d}$, we suppose that the true function $f_{\rho}^{*} \in [\mathcal{H}]^{s}$, the interpolation space of $\mathcal{H}$ with source condition $s>0$. We first determined the exact order (both upper and lower bound) of the generalization error of kernel ridge regression for the optimally chosen regularization parameter $\lambda$. We then further showed that when $01$, KRR is not minimax optimal (a.k.a. he saturation effect). Our results illustrate that the curves of rate varying along $\gamma$ exhibit the periodic plateau behavior and the multiple descent behavior and show how the curves evolve with $s>0$. Interestingly, our work provides a unified viewpoint of several recent works on kernel regression in the large-dimensional setting, which correspond to $s=0$ and $s=1$ respectively.


Optimal Rate of Kernel Regression in Large Dimensions

Lu, Weihao, Zhang, Haobo, Li, Yicheng, Xu, Manyun, Lin, Qian

arXiv.org Machine Learning

We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^{\gamma}$ for some $\gamma >0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metric entropy $\bar{\varepsilon}_{n}^{2}$ respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on $\mathbb{S}^{d}$, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n^{-1/2}$ when $n\asymp d^{\gamma}$ for $\gamma =2, 4, 6, 8, \cdots$. We then further determine the optimal rate of the excess risk of kernel regression for all the $\gamma>0$ and find that the curve of optimal rate varying along $\gamma$ exhibits several new phenomena including the {\it multiple descent behavior} and the {\it periodic plateau behavior}. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.


How rotational invariance of common kernels prevents generalization in high dimensions

Donhauser, Konstantin, Wu, Mingqi, Yang, Fanny

arXiv.org Machine Learning

Kernel ridge regression is well-known to achieve minimax optimal rates in low-dimensional settings. However, its behavior in high dimensions is much less understood. Recent work establishes consistency for kernel regression under certain assumptions on the ground truth function and the distribution of the input data. In this paper, we show that the rotational invariance property of commonly studied kernels (such as RBF, inner product kernels and fully-connected NTK of any depth) induces a bias towards low-degree polynomials in high dimensions. Our result implies a lower bound on the generalization error for a wide range of distributions and various choices of the scaling for kernels with different eigenvalue decays. This lower bound suggests that general consistency results for kernel ridge regression in high dimensions require a more refined analysis that depends on the structure of the kernel beyond its eigenvalue decay.


New Probabilistic Bounds on Eigenvalues and Eigenvectors of Random Kernel Matrices

Reyhani, Nima, Hino, Hideitsu, Vigario, Ricardo

arXiv.org Machine Learning

Kernel methods are successful approaches for different machine learning problems. This success is mainly rooted in using feature maps and kernel matrices. Some methods rely on the eigenvalues/eigenvectors of the kernel matrix, while for other methods the spectral information can be used to estimate the excess risk. An important question remains on how close the sample eigenvalues/eigenvectors are to the population values. In this paper, we improve earlier results on concentration bounds for eigenvalues of general kernel matrices. Meanwhile, the obstacles for sharper bounds are accounted for and partially addressed. As a case study, we derive a concentration inequality for sample kernel target-alignment. 1 INTRODUCTION Kernel methods such as Spectral Clustering, Kernel Principal Component Analysis(KPCA), and Support Vector Machines, are successful approaches in many practical machine learning and data analysis problems (Steinwart & Christmann, 2008). The main ingredient of these methods is the kernel matrix, which is built using the kernel function, evaluated at given sample points.