Accelerated Doubly Stochastic Gradient Algorithm for Large-scale Empirical Risk Minimization

Shen, Zebang, Qian, Hui, Mu, Tongzhou, Zhang, Chao

arXiv.org Artificial Intelligence 

's andP(x) is a block-separable and convex regularization function. We allow both F(x) and P(x) to be non-smooth. In machine learning applications, many tasks can be naturally phrased as the above problem, e.g. the Empirical Risk Minimization (ERM) problem Zhang and Xiao [2015], Friedman et al. [2001]. However, the latest explosive growth of data introduces unprecedented computational challenges of scalability and storage bottleneck. In consequence, algorithms with fast convergence, small memory footprints, and low per-iteration complexity have been ardently pursued in the recent years. Most successful practices adopt stochastic strategies that incorporate randomness into solving procedures. They followed two parallel tracks. That is, in each iteration, gradient is estimated by either a mini batch of samples or a small block of variable coordinates, commonly designated by a random procedure. Accessing only a random mini batch of samples in each iteration, classical Stochastic Gradient Descent (SGD) and its Variance Reduction (VR) variants, such as SVRG Johnson and Zhang [2013], SAGA Defazio et al. [2014], and SAG Schmidt et al. [2013], have gained increasing attention in the last quinquennium.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found