Optimal tuning for divide-and-conquer kernel ridge regression with massive data

arXiv.org Machine Learning

We propose a first data-driven tuning procedure for divide-and-conquer kernel ridge regression (Zhang et al., 2015). While the proposed criterion is computationally scalable for massive data sets, it is also shown to be asymptotically optimal under mild conditions. The effectiveness of our method is illustrated by extensive simulations and an application to Million Song Dataset.


Truthful Linear Regression

arXiv.org Machine Learning

We consider the problem of fitting a linear model to data held by individuals who are concerned about their privacy. Incentivizing most players to truthfully report their data to the analyst constrains our design to mechanisms that provide a privacy guarantee to the participants; we use differential privacy to model individuals' privacy losses. This immediately poses a problem, as differentially private computation of a linear model necessarily produces a biased estimation, and existing approaches to design mechanisms to elicit data from privacy-sensitive individuals do not generalize well to biased estimators. We overcome this challenge through an appropriate design of the computation and payment scheme.


Implicit ridge regularization provided by the minimum-norm least squares estimator when $n\ll p$

arXiv.org Machine Learning

A conventional wisdom in statistical learning is that large models require strong regularization to prevent overfitting. This rule has been recently challenged by deep neural networks: despite being expressive enough to fit any training set perfectly, they still generalize well. Here we show that the same is true for linear regression in the under-determined $n\ll p$ situation, provided that one uses the minimum-norm estimator. The case of linear model with least squares loss allows full and exact mathematical analysis. We prove that augmenting a model with many random covariates with small constant variance and using minimum-norm estimator is asymptotically equivalent to adding the ridge penalty. Using toy example simulations as well as real-life high-dimensional data sets, we demonstrate that explicit ridge penalty often fails to provide any improvement over this implicit ridge regularization. In this regime, minimum-norm estimator achieves zero training error but nevertheless has low expected error.


Singular ridge regression with homoscedastic residuals: generalization error with estimated parameters

arXiv.org Machine Learning

This paper characterizes the conditional distribution properties of the finite sample ridge regression estimator and uses that result to evaluate total regression and generalization errors that incorporate the inaccuracies committed at the time of parameter estimation. The paper provides explicit formulas for those errors. Unlike other classical references in this setup, our results take place in a fully singular setup that does not assume the existence of a solution for the non-regularized regression problem. In exchange, we invoke a conditional homoscedasticity hypothesis on the regularized regression residuals that is crucial in our developments.


A Unified Analysis of Random Fourier Features

arXiv.org Machine Learning

We provide the first unified theoretical analysis of supervised learning with random Fourier features, covering different types of loss functions characteristic to kernel methods developed for this setting. More specifically, we investigate learning with squared error and Lipschitz continuous loss functions and give the sharpest expected risk convergence rates for problems in which random Fourier features are sampled either using the spectral measure corresponding to a shift-invariant kernel or the ridge leverage score function proposed in~\cite{avron2017random}. The trade-off between the number of features and the expected risk convergence rate is expressed in terms of the regularization parameter and the effective dimension of the problem. While the former can effectively capture the complexity of the target hypothesis, the latter is known for expressing the fine structure of the kernel with respect to the marginal distribution of a data generating process~\cite{caponnetto2007optimal}. In addition to our theoretical results, we propose an approximate leverage score sampler for large scale problems and show that it can be significantly more effective than the spectral measure sampler.