Goto

Collaborating Authors

 escaping saddle point


SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

Neural Information Processing Systems

We analyze stochastic gradient algorithms for optimizing nonconvex problems. In particular, our goal is to find local minima (second-order stationary points) instead of just finding first-order stationary points which may be some bad unstable saddle points.


Zeroth-Order Negative Curvature Finding: Escaping Saddle Points without Gradients

Neural Information Processing Systems

We consider escaping saddle points of nonconvex problems where only the function evaluations can be accessed. Although a variety of works have been proposed, the majority of them require either second or first-order information, and only a few of them have exploited zeroth-order methods, particularly the technique of negative curvature finding with zeroth-order methods which has been proven to be the most efficient method for escaping saddle points. To fill this gap, in this paper, we propose two zeroth-order negative curvature finding frameworks that can replace Hessian-vector product computations without increasing the iteration complexity. We apply the proposed frameworks to ZO-GD, ZO-SGD, ZO-SCSG, ZO-SPIDER and prove that these ZO algorithms can converge to $(\epsilon,\delta)$-approximate second-order stationary points with less query complexity compared with prior zeroth-order works for finding local minima.


Escaping Saddle Points for Effective Generalization on Class-Imbalanced Data

Neural Information Processing Systems

Several techniques based on re-weighting and margin adjustment of loss are often used to enhance the performance of neural networks, particularly on minority classes. In this work, we analyze the class-imbalanced learning problem by examining the loss landscape of neural networks trained with re-weighting and margin based techniques. Specifically, we examine the spectral density of Hessian of class-wise loss, through which we observe that the network weights converges to a saddle point in the loss landscapes of minority classes. Following this observation, we also find that optimization methods designed to escape from saddle points can be effectively used to improve generalization on minority classes. We further theoretically and empirically demonstrate that Sharpness-Aware Minimization (SAM), a recent technique that encourages convergence to a flat minima, can be effectively used to escape saddle points for minority classes.


Escaping Saddle Points with Compressed SGD

Neural Information Processing Systems

Stochastic gradient descent (SGD) is a prevalent optimization technique for large-scale distributed machine learning. While SGD computation can be efficiently divided between multiple machines, communication typically becomes a bottleneck in the distributed setting. Gradient compression methods can be used to alleviate this problem, and a recent line of work shows that SGD augmented with gradient compression converges to an $\varepsilon$-first-order stationary point. In this paper we extend these results to convergence to an $\varepsilon$-second-order stationary point ($\varepsilon$-SOSP), which is to the best of our knowledge the first result of this type. In addition, we show that, when the stochastic gradient is not Lipschitz, compressed SGD with RandomK compressor converges to an $\varepsilon$-SOSP with the same number of iterations as uncompressed SGD [Jin et al.,2021] (JACM), while improving the total communication by a factor of $\tilde \Theta(\sqrt{d} \varepsilon^{-3/4})$, where $d$ is the dimension of the optimization problem. We present additional results for the cases when the compressor is arbitrary and when the stochastic gradient is Lipschitz.


Escaping Saddle Points in Constrained Optimization

Neural Information Processing Systems

In this paper, we study the problem of escaping from saddle points in smooth nonconvex optimization problems subject to a convex set $\mathcal{C}$. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic objective function.


Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training

Hu, Yue, Cao, Zanxia, Liu, Yingchao

arXiv.org Machine Learning

First-order optimization methods, such as SGD and Adam, are widely used for training large-scale deep neural networks due to their computational efficiency and robust performance. However, relying solely on gradient information, these methods often struggle to navigate complex loss landscapes with flat regions, plateaus, and saddle points. Second-order methods, which use curvature information from the Hessian matrix, can address these challenges but are computationally infeasible for large models. The Dimer method, a first-order technique that constructs two closely spaced points to probe the local geometry of a potential energy surface, efficiently estimates curvature using only gradient information. Inspired by its use in molecular dynamics simulations for locating saddle points, we propose Dimer-Enhanced Optimization (DEO), a novel framework to escape saddle points in neural network training. DEO adapts the Dimer method to explore a broader region of the loss landscape, approximating the Hessian's smallest eigenvector without computing the full matrix. By periodically projecting the gradient onto the subspace orthogonal to the minimum curvature direction, DEO guides the optimizer away from saddle points and flat regions, enhancing training efficiency with non-stepwise updates. Preliminary experiments on a Transformer toy model show DEO achieves competitive performance compared to standard first-order methods, improving navigation of complex loss landscapes. Our work repurposes physics-inspired, first-order curvature estimation to enhance neural network training in high-dimensional spaces.


Reviews: SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

Neural Information Processing Systems

After response----I keep my evaluation on the technical innovation and suboptimality of this paper. The basic Spider and SpiderBoost algorithms are both for first-order stationary point, they are almost the same, and both give n {1/2} rate. The simple way to modify both algorithms to escape saddle point is to add Negative Curvature Search (NCS) subroutine (which can be done in a very modular way, and is already shown in the Spider paper). I'd say it's almost trivial to also show SpiderBoost NCS to find second-order stationary point with n {1/2} rate. Comparing this paper with SpiderBoost NCS, there's no improvement from n {2/3} to n {1/2} (since Spiderboost is already n {1/2}), no simplification of Spider (as Spiderboost already did so). The only difference is replacing NCS by perturbations, which again requires some work, but most techniques are already there.


Reviews: SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

Neural Information Processing Systems

After extensive back and forth discussion by the reviewers, ultimately we felt that despite trivial solutions that are possible by mixing spider with negative curvature search, the approach of the present paper has its usefulness (as noted in the reviews too), and that the paper can be accepted. The authors are, however, strongly encouraged to look into the reviews carefully, and implement the points mentioned in the rebuttal -- because it would be a pity if after all the back-n-forth that this paper's review cycle has witnessed, if the authors did not update the paper to clarify its contribution, its value, and its contrasts with less implementable methods.


Zeroth-Order Negative Curvature Finding: Escaping Saddle Points without Gradients

Neural Information Processing Systems

We consider escaping saddle points of nonconvex problems where only the function evaluations can be accessed. Although a variety of works have been proposed, the majority of them require either second or first-order information, and only a few of them have exploited zeroth-order methods, particularly the technique of negative curvature finding with zeroth-order methods which has been proven to be the most efficient method for escaping saddle points. To fill this gap, in this paper, we propose two zeroth-order negative curvature finding frameworks that can replace Hessian-vector product computations without increasing the iteration complexity. We apply the proposed frameworks to ZO-GD, ZO-SGD, ZO-SCSG, ZO-SPIDER and prove that these ZO algorithms can converge to (\epsilon,\delta) -approximate second-order stationary points with less query complexity compared with prior zeroth-order works for finding local minima.


Escaping Saddle Points for Effective Generalization on Class-Imbalanced Data

Neural Information Processing Systems

Several techniques based on re-weighting and margin adjustment of loss are often used to enhance the performance of neural networks, particularly on minority classes. In this work, we analyze the class-imbalanced learning problem by examining the loss landscape of neural networks trained with re-weighting and margin based techniques. Specifically, we examine the spectral density of Hessian of class-wise loss, through which we observe that the network weights converges to a saddle point in the loss landscapes of minority classes. Following this observation, we also find that optimization methods designed to escape from saddle points can be effectively used to improve generalization on minority classes. We further theoretically and empirically demonstrate that Sharpness-Aware Minimization (SAM), a recent technique that encourages convergence to a flat minima, can be effectively used to escape saddle points for minority classes.