Reviews: Fast and Accurate Stochastic Gradient Estimation

Neural Information Processing Systems 

Summary: This paper develops a new method for adaptively sampling training examples during stochastic optimization. It is known that the optimal distribution that minimizes the nuclear norm of the covariance of the gradient estimate is one where the the probability of sampling an example is proportional to the magnitude of the gradient of the loss on that example. Sampling according to this distribution is of course impractical, because computing this distribution is as expensive as computing the full gradient and requires O(N) time per iteration. To get around this, prior work either maintains a fixed distribution across all iterations or makes strong assumptions on the distribution of gradients of different training examples (e.g.: the gradients of training examples of the same class are similar). This paper proposes a method that can adaptively sample from different distributions every iteration and requires little assumptions on the distribution of gradients, and yet requires the same per-iteration cost as SGD.