nonconvex case
Convergence of SGD with momentum in the nonconvex case: A time window-based analysis
Qiu, Junwen, Ma, Bohao, Milzarek, Andre
We propose a novel time window-based analysis technique to investigate the convergence properties of the stochastic gradient descent method with momentum (SGDM) in nonconvex settings. Despite its popularity, the convergence behavior of SGDM remains less understood in nonconvex scenarios. This is primarily due to the absence of a sufficient descent property and challenges in simultaneously controlling the momentum and stochastic errors in an almost sure sense. To address these challenges, we investigate the behavior of SGDM over specific time windows, rather than examining the descent of consecutive iterates as in traditional studies. This time window-based approach simplifies the convergence analysis and enables us to establish the first iterate convergence result for SGDM under the Kurdyka-Lojasiewicz (KL) property. We further provide local convergence rates which depend on the underlying KL exponent and the utilized step size schemes.
A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization
We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results and improves/generalizes them (in terms of the number of stochastic gradient oracle calls and proximal oracle calls). In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei et al., NIPS'17] for the smooth nonconvex case. ProxSVRG+ is also more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in [Reddi et al., NIPS'16]. Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG [Reddi et al., NIPS'16]. Moreover, for nonconvex functions satisfied Polyak-\L{}ojasiewicz condition, we prove that ProxSVRG+ achieves a global linear convergence rate without restart unlike ProxSVRG. Thus, it can \emph{automatically} switch to the faster linear convergence in some regions as long as the objective function satisfies the PL condition locally in these regions. Finally, we conduct several experiments and the experimental results are consistent with the theoretical results.
A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization
We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results and improves/generalizes them (in terms of the number of stochastic gradient oracle calls and proximal oracle calls). In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei et al., NIPS'17] for the smooth nonconvex case. ProxSVRG+ is also more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in [Reddi et al., NIPS'16]. Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG [Reddi et al., NIPS'16]. Moreover, for nonconvex functions satisfied Polyak-\L{}ojasiewicz condition, we prove that ProxSVRG+ achieves a global linear convergence rate without restart unlike ProxSVRG. Thus, it can \emph{automatically} switch to the faster linear convergence in some regions as long as the objective function satisfies the PL condition locally in these regions. Finally, we conduct several experiments and the experimental results are consistent with the theoretical results.
Generalization Error Bounds for Optimization Algorithms via Stability
Meng, Qi (Peking University) | Wang, Yue (Beijing Jiaotong University) | Chen, Wei (Microsoft Research) | Wang, Taifeng (Microsoft Research) | Ma, Zhi-Ming (Chinese Academy of Mathematics and Systems Science) | Liu, Tie-Yan (Microsoft Research)
Many machine learning tasks can be formulated as Regularized Empirical Risk Minimization (R-ERM), and solved by optimization algorithms such as gradient descent (GD), stochastic gradient descent (SGD), and stochastic variance reduction (SVRG). Conventional analysis on these optimization algorithms focuses on their convergence rates during the training process, however, people in the machine learning community may care more about the generalization performance of the learned model on unseen test data. In this paper, we investigate on this issue, by using stability as a tool. In particular, we decompose the generalization error for R-ERM, and derive its upper bound for both convex and nonconvex cases. In convex cases, we prove that the generalization error can be bounded by the convergence rate of the optimization algorithm and the stability of the R-ERM process, both in expectation (in the order of 𝒪(1/ n )+ 𝔼ρ( T )), where ρ( T ) is the convergence error and T is the number of iterations) and in high probability (in the order of 𝒪(log{1/δ / √ n + ρ( T ) with probability 1 – δ). For nonconvex cases, we can also obtain a similar expected generalization error bound. Our theorems indicate that 1) along with the training process, the generalization error will decrease for all the optimization algorithms under our investigation; 2) Comparatively speaking, SVRG has better generalization ability than GD and SGD. We have conducted experiments on both convex and nonconvex problems, and the experimental results verify our theoretical findings.
Generalization Error Bounds for Optimization Algorithms via Stability
Meng, Qi, Wang, Yue, Chen, Wei, Wang, Taifeng, Ma, Zhi-Ming, Liu, Tie-Yan
Many machine learning tasks can be formulated as Regularized Empirical Risk Minimization (R-ERM), and solved by optimization algorithms such as gradient descent (GD), stochastic gradient descent (SGD), and stochastic variance reduction (SVRG). Conventional analysis on these optimization algorithms focuses on their convergence rates during the training process, however, people in the machine learning community may care more about the generalization performance of the learned model on unseen test data. In this paper, we investigate on this issue, by using stability as a tool. In particular, we decompose the generalization error for R-ERM, and derive its upper bound for both convex and non-convex cases. In convex cases, we prove that the generalization error can be bounded by the convergence rate of the optimization algorithm and the stability of the R-ERM process, both in expectation (in the order of $\mathcal{O}((1/n)+\mathbb{E}\rho(T))$, where $\rho(T)$ is the convergence error and $T$ is the number of iterations) and in high probability (in the order of $\mathcal{O}\left(\frac{\log{1/\delta}}{\sqrt{n}}+\rho(T)\right)$ with probability $1-\delta$). For non-convex cases, we can also obtain a similar expected generalization error bound. Our theorems indicate that 1) along with the training process, the generalization error will decrease for all the optimization algorithms under our investigation; 2) Comparatively speaking, SVRG has better generalization ability than GD and SGD. We have conducted experiments on both convex and non-convex problems, and the experimental results verify our theoretical findings.