prox-svrg
a376033f78e144f494bfc743c0be3330-Supplemental.pdf
Inthis section, we provide theoretical analysis ofHSPG. Moreover, we further point out that: (1) theSub-gradient Descent Stepwe used to achieve a "close enough" solution canbereplaced byothermethods, and(2)theAssumption 4isonlyasufficientcondition thatwecouldusetoshowthe"closeenough"condition. B.1 RelatedWork Problem (12)has been well studied indeterministic optimization with various algorithms that are capable ofreturning solutions with both lowobjectivevalueandhigh group sparsity under proper ฮป(95;73;42;64). For example, proximal stochastic variance-reduced gradient method (Prox-SVRG)(88)and proximal spider (Prox-Spider) (97) are developed to adopt multi-stage schemes based on the well-known variance reduction technique SVRG proposed in (46) and Spider developed in (22) respectively. Under Assumption 1, the search directiondk is a descent direction forฯBk(xk), i.e., d>k ฯBk(xk)<0.
Stochastic Proximal Gradient Descent with Acceleration Techniques
Proximal gradient descent (PGD) and stochastic proximal gradient descent (SPGD) are popular methods for solving regularized risk minimization problems in machine learning and statistics. In this paper, we propose and analyze an accelerated variant of these methods in the mini-batch setting. This method incorporates two acceleration techniques: one is Nesterov's acceleration method, and the other is a variance reduction for the stochastic gradient. Accelerated proximal gradient descent (APG) and proximal stochastic variance reduction gradient (Prox-SVRG) are in a trade-off relationship. We show that our method, with the appropriate mini-batch size, achieves lower overall complexity than both APG and Prox-SVRG.
Orthant Based Proximal Stochastic Gradient Method for $\ell_1$-Regularized Optimization
Chen, Tianyi, Ding, Tianyu, Ji, Bo, Wang, Guanyi, Shi, Yixin, Yi, Sheng, Tu, Xiao, Zhu, Zhihui
Sparsity-inducing regularization problems are ubiquitous in machine learning applications, ranging from feature selection to model compression. In this paper, we present a novel stochastic method -- Orthant Based Proximal Stochastic Gradient Method (OBProx-SG) -- to solve perhaps the most popular instance, i.e., the l1-regularized problem. The OBProx-SG method contains two steps: (i) a proximal stochastic gradient step to predict a support cover of the solution; and (ii) an orthant step to aggressively enhance the sparsity level via orthant face projection. Compared to the state-of-the-art methods, e.g., Prox-SG, RDA and Prox-SVRG, the OBProx-SG not only converges to the global optimal solutions (in convex scenario) or the stationary points (in non-convex scenario), but also promotes the sparsity of the solutions substantially. Particularly, on a large number of convex problems, OBProx-SG outperforms the existing methods comprehensively in the aspect of sparsity exploration and objective values. Moreover, the experiments on non-convex deep neural networks, e.g., MobileNetV1 and ResNet18, further demonstrate its superiority by achieving the solutions of much higher sparsity without sacrificing generalization accuracy.
Stochastic Proximal Gradient Descent with Acceleration Techniques
Proximal gradient descent (PGD) and stochastic proximal gradient descent (SPGD) are popular methods for solving regularized risk minimization problems in machine learning and statistics. In this paper, we propose and analyze an accelerated variant of these methods in the mini-batch setting. This method incorporates two acceleration techniques: one is Nesterov's acceleration method, and the other is a variance reduction for the stochastic gradient. Accelerated proximal gradient descent (APG) and proximal stochastic variance reduction gradient (Prox-SVRG) are in a trade-off relationship. We show that our method, with the appropriate mini-batch size, achieves lower overall complexity than both APG and Prox-SVRG.
Breaking the Span Assumption Yields Fast Finite-Sum Minimization
Hannah, Robert, Liu, Yanli, O', Connor, Daniel, Yin, Wotao
In this paper, we show that SVRG and SARAH can be modified to be fundamentally faster than all of the other standard algorithms that minimize the sum of $n$ smooth functions, such as SAGA, SAG, SDCA, and SDCA without duality. Most finite sum algorithms follow what we call the ``span assumption'': Their updates are in the span of a sequence of component gradients chosen in a random IID fashion. In the big data regime, where the condition number $\kappa=O(n)$, the span assumption prevents algorithms from converging to an approximate solution of accuracy $\epsilon$ in less than $n\ln(1/\epsilon)$ iterations. SVRG and SARAH do not follow the span assumption since they are updated with a hybrid of full-gradient and component-gradient information. We show that because of this, they can be up to $\Omega(1+(\ln(n/\kappa))_+)$ times faster. In particular, to obtain an accuracy $\epsilon = 1/n^\alpha$ for $\kappa=n^\beta$ and $\alpha,\beta\in(0,1)$, modified SVRG requires $O(n)$ iterations, whereas algorithms that follow the span assumption require $O(n\ln(n))$ iterations. Moreover, we present lower bound results that show this speedup is optimal, and provide analysis to help explain why this speedup exists. With the understanding that the span assumption is a point of weakness of finite sum algorithms, future work may purposefully exploit this to yield faster algorithms in the big data regime.
Breaking the Span Assumption Yields Fast Finite-Sum Minimization
Hannah, Robert, Liu, Yanli, O', Connor, Daniel, Yin, Wotao
In this paper, we show that SVRG and SARAH can be modified to be fundamentally faster than all of the other standard algorithms that minimize the sum of $n$ smooth functions, such as SAGA, SAG, SDCA, and SDCA without duality. Most finite sum algorithms follow what we call the ``span assumption'': Their updates are in the span of a sequence of component gradients chosen in a random IID fashion. In the big data regime, where the condition number $\kappa=O(n)$, the span assumption prevents algorithms from converging to an approximate solution of accuracy $\epsilon$ in less than $n\ln(1/\epsilon)$ iterations. SVRG and SARAH do not follow the span assumption since they are updated with a hybrid of full-gradient and component-gradient information. We show that because of this, they can be up to $\Omega(1+(\ln(n/\kappa))_+)$ times faster. In particular, to obtain an accuracy $\epsilon = 1/n^\alpha$ for $\kappa=n^\beta$ and $\alpha,\beta\in(0,1)$, modified SVRG requires $O(n)$ iterations, whereas algorithms that follow the span assumption require $O(n\ln(n))$ iterations. Moreover, we present lower bound results that show this speedup is optimal, and provide analysis to help explain why this speedup exists. With the understanding that the span assumption is a point of weakness of finite sum algorithms, future work may purposefully exploit this to yield faster algorithms in the big data regime.
A New Analysis of Variance Reduced Stochastic Proximal Methods for Composite Optimization with Serial and Asynchronous Realizations
We provide a comprehensive analysis of stochastic variance reduced gradient (SVRG) based proximal algorithms, both with and without momentum, in serial and asynchronous realizations. Specifically, we propose the Prox-SVRG$^{++}$ algorithm, and prove that it has a linear convergence rate with a small epoch length and we obtain an $O(1/\epsilon)$ complexity in non-strongly convex case. Then, we propose a momentum accelerated algorithm, called Prox-MSVRG$^{++}$, and show that it achieves a complexity of $O(1/\sqrt{\epsilon})$. After that, we develop two asynchronous versions based on the above serial algorithms and provide a general analysis under nonconvex and non-strongly convex cases respectively. Our theoretical results indicate that the algorithms can achieve a potentially significant speedup when implemented with multiple servers. We conduct extensive experiments based on $4$ real-world datasets on an experiment platform with $11$ physical machines. The experiments validate our theoretical findings and demonstrate the effectiveness of our algorithms.