Reviews: Reducing the variance in online optimization by transporting past gradients

Neural Information Processing Systems 

This paper proposes a novel gradient estimator that performs a weighted average (similar to momentum) of past and new gradients to estimate the gradient at the current iterate. To motivate their estimator, the authors demonstrate that the SG method with momentum does not decrease the variance unless the momentum parameter is increased like 1 – 1/t. The IGT estimator is then derived by considering the quadratic case (where the Hessian matrix is fixed for all individual functions) with the goal of estimating the "true" online gradient (the simple average over all previously seen gradients). In order to compensate for the bias of past gradients, the new gradient is notably evaluated at an extrapolated point, not at the current point. This derived estimator yields an O(1/t) reduction in variance, yielding a theoretical result that may be interpreted as linear convergence to a neighborhood that shrinks as O(1/t) with constant steplength for quadratic problems.