Musco, Cameron, Musco, Christopher

We give the first algorithm for kernel Nystrom approximation that runs in linear time in the number of training points and is provably accurate for all kernel matrices, without dependence on regularity or incoherence conditions. The algorithm projects the kernel onto a set of s landmark points sampled by their ridge leverage scores, requiring just O(ns) kernel evaluations and O(ns^2) additional runtime. While leverage score sampling has long been known to give strong theoretical guarantees for Nystrom approximation, by employing a fast recursive sampling scheme, our algorithm is the first to make the approach scalable. Empirically we show that it finds more accurate kernel approximations in less time than popular techniques such as classic Nystrom approximation and the random Fourier features method.

Musco, Cameron, Musco, Christopher

We give the first algorithm for kernel Nystr\"om approximation that runs in *linear time in the number of training points* and is provably accurate for all kernel matrices, without dependence on regularity or incoherence conditions. The algorithm projects the kernel onto a set of $s$ landmark points sampled by their *ridge leverage scores*, requiring just $O(ns)$ kernel evaluations and $O(ns^2)$ additional runtime. While leverage score sampling has long been known to give strong theoretical guarantees for Nystr\"om approximation, by employing a fast recursive sampling scheme, our algorithm is the first to make the approach scalable. Empirically we show that it finds more accurate, lower rank kernel approximations in less time than popular techniques such as uniformly sampled Nystr\"om approximation and the random Fourier features method.

Alaoui, Ahmed El, Mahoney, Michael W.

One approach to improving the running time of kernel-based machine learning methods is to build a small sketch of the input and use it in lieu of the full kernel matrix in the machine learning task of interest. Here, we describe a version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance. By extending the notion of \emph{statistical leverage scores} to the setting of kernel ridge regression, our main statistical result is to identify an importance sampling distribution that reduces the size of the sketch (i.e., the required number of columns to be sampled) to the \emph{effective dimensionality} of the problem. This quantity is often much smaller than previous bounds that depend on the \emph{maximal degrees of freedom}. Our main algorithmic result is to present a fast algorithm to compute approximations to these scores. This algorithm runs in time that is linear in the number of samples---more precisely, the running time is $O(np^2)$, where the parameter $p$ depends only on the trace of the kernel matrix and the regularization parameter---and it can be applied to the matrix of feature vectors, without having to form the full kernel matrix. This is obtained via a variant of length-squared sampling that we adapt to the kernel setting in a way that is of independent interest. Lastly, we provide empirical results illustrating our theory, and we discuss how this new notion of the statistical leverage of a data point captures in a fine way the difficulty of the original statistical learning problem.

Musco, Cameron, Musco, Christopher

We give the first algorithm for kernel Nystrom approximation that runs in linear time in the number of training points and is provably accurate for all kernel matrices, without dependence on regularity or incoherence conditions. The algorithm projects the kernel onto a set of s landmark points sampled by their ridge leverage scores, requiring just O(ns) kernel evaluations and O(ns 2) additional runtime. While leverage score sampling has long been known to give strong theoretical guarantees for Nystrom approximation, by employing a fast recursive sampling scheme, our algorithm is the first to make the approach scalable. Empirically we show that it finds more accurate kernel approximations in less time than popular techniques such as classic Nystrom approximation and the random Fourier features method. Papers published at the Neural Information Processing Systems Conference.

Li, Zhu, Ton, Jean-Francois, Oglic, Dino, Sejdinovic, Dino

We provide the first unified theoretical analysis of supervised learning with random Fourier features, covering different types of loss functions characteristic to kernel methods developed for this setting. More specifically, we investigate learning with squared error and Lipschitz continuous loss functions and give the sharpest expected risk convergence rates for problems in which random Fourier features are sampled either using the spectral measure corresponding to a shift-invariant kernel or the ridge leverage score function proposed in~\cite{avron2017random}. The trade-off between the number of features and the expected risk convergence rate is expressed in terms of the regularization parameter and the effective dimension of the problem. While the former can effectively capture the complexity of the target hypothesis, the latter is known for expressing the fine structure of the kernel with respect to the marginal distribution of a data generating process~\cite{caponnetto2007optimal}. In addition to our theoretical results, we propose an approximate leverage score sampler for large scale problems and show that it can be significantly more effective than the spectral measure sampler.