First-order Methods Almost Always Avoid Saddle Points

Lee, Jason D., Panageas, Ioannis, Piliouras, Georgios, Simchowitz, Max, Jordan, Michael I., Recht, Benjamin

arXiv.org Machine Learning 

Saddle points have long been regarded as a major obstacle for non-convex optimization over continuous spaces. It is well understood that in many applications of interest, the number of saddle points significantly outnumber the number of local minima, which is especially problematic when the solutions associated with worst-case saddle points are considerably worse than those associated with worst-case local minima [12, 14, 34]. Moreover, it is not hard to construct examples where a worst-case initialization of gradient descent (or other first-order methods) provably converge to saddle points [30, Section 1.2.3]. The main message of our paper is that, under very mild regularity conditions, saddle points have little effect on the asymptotic behavior of first-order methods. Building on tools from the theory of dynamical systems, we generalize recent analysis of gradient descent [24, 33] to establish that a wide variety of first-order methods -- including gradient descent, proximal point algorithm, block coordinate descent, mirror descent -- avoid so-called "strict" saddle points for almost all initializations; that is, saddle points where the Hessian of the objective function admits at least one direction of negative curvature (see Definition 1). Our results provide a unified theoretical framework for analyzing the asymptotic behavior of a wide variety of classic optimization heuristics in non-convex optimization. Furthermore, we believe that furthering our understanding of the behavior and geometry of deterministic optimization techniques with random initialization can serve in the development of stochastic algorithms which improve upon their deterministic counterparts and achieve strong convergence-rate results; indeed, such insights have already led to significant improves in modifying gradient descent to navigate saddle-point geometry [15, 21]. This paper significantly extends upon the special case of gradient descent dynamics developed in the conference proceedings of the authors [24, 33].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found