Reviews: Learning with SGD and Random Features

Neural Information Processing Systems 

I have updated my score to an 8 accordingly. I think the planned updates to the empirical section will add a lot of value. Summary: This paper analyzes the generalization performance of models trained using mini-batch stochastic gradient methods with random features (eg, for kernel approximation), for regression tasks using the least squares loss. Their main theorem (Theorem 1) bounds the gap in generalization performance between the lowest risk model in the RKHS with the SGD trained model after t mini-batch updates; the bound is in terms of the learning rate, the mini-batch size, the number of random features M, the training set size n, and t. They show that under certain choices of these parameters, M O(sqrt(n)) features are sufficient to guarantee that the gap is 1/sqrt(n).