Direct Acceleration of SAGA using Sampled Negative Momentum
We focus on achieving very high accuracy for Problem (1), although for practical optimization tasks, such as supervised learning, low empirical risk may result in high generalization error. In this paper, we treat Problem (1) as a pure optimization problem. When F(·) in Problem (1) is strongly convex, traditional analysis shows that gradient descent (GD) yields a fast linear convergence rate but with a high per-iteration cost, and thus may not be suitable for problems with a very large n. As an alternative for large problems, SGD [Robbins and Monro, 1951] uses only one or a mini-batch of gradients in each iteration, and thus enjoys significantly lower per-iteration complexity than GD. However, due to the variance of gradient estimator, vanilla SGD is shown to yield only a sub-linear convergence rate. Recently, stochastic variance reduced methods (e.g., SAG [Roux et al., 2012], SVRG [Johnson and Zhang, 2013], SAGA [Defazio et al., 2014], and their proximal variants, such as [Schmidt et al., 2017], [Xiao and Zhang, 2014] and [Konečný et al., 2016]) were proposed to solve Problem (1). All these methods are equipped with various variance reduction techniques, which help them achieve low per-iteration complexities comparable with SGD and at the same time maintain the fast linear convergence rate of GD.
Jun-28-2018
- Country:
- Asia
- Middle East > Jordan (0.04)
- China > Hong Kong (0.04)
- Asia
- Genre:
- Research Report (0.65)
- Technology: