37f0e884fbad9667e38940169d0a3c95-Reviews.html

Neural Information Processing Systems 

The optimal first-order algorithm of Nesterov has linear convergence for such problem but the constant depends on the square root of the condition number k. The authors consider the situation where one has access to the expensive full gradient of the objective as well as a cheap stochastic gradient oracle. They propose a hybrid algorithm which only requires O(log 1/eps) calls to the full gradient oracle (independent of the condition number) and O(k^2 log(1/eps)) calls to the cheaper stochastic gradient oracle -- as long as the condition number is not too big, this could be faster in theory. The main idea behind their algorithm(called Epoch Mixed Gradient Descent - EMGD) is to replace a full gradient step (called an epoch) with a fixed number O(k^2) of mixed gradient steps which use a combination of the full gradient (computed once for the epoch) and stochastic gradients (which vary within an epoch). By taking the average of the O(k^2) iterates within an epoch, they can show a constant decrease of the suboptimality *independent* of the condition number, which is why the number of required full gradient step computations (the number of epochs) is independent from the condition number. They provide a simple and complete self-contained proof of their convergence rate, but no experiment.