Goto

Collaborating Authors

 line search


6d0bf1265ea9635fb4f9d56f16d7efb2-Supplemental-Conference.pdf

Neural Information Processing Systems

Supplementary Materials for "Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models" Appendix A The Algorithm Appendix B Convergence Rates Appendix B.1 Rate of Convergence for Strongly Convex Functions Appendix B.2 Rate of Convergence for Convex Functions Appendix B.3 Rate of Convergence for Functions Satisfying the PL Condition Appendix B.4 Common Lemmas Appendix B.5 The Polyak Step Size is Bounded Appendix C Experimental details Appendix D Plots Completing the Figures in the Main Paper Appendix D.1 Comparison between PoNoS and the state-of-the-art Appendix D.2 A New Resetting Technique Appendix D.3 Time Comparison Appendix D.4 Experiments on Convex Losses Appendix D.5 Experiments on Transformers Appendix E Additional Plots Appendix E.1 Study on the Choice of c: Theory (0.5) vs Practice (0.1) Appendix E.2 Study on the Line Search Choice: V arious Nonmonotone Adaptations Appendix E.3 Zoom in on the Amount of Backtracks Appendix E.4 Study on the Choice of η In this section, we give the details of our proposed algorithm PoNoS. Training machine learning models (e.g., neural networks) entails solving the following finite sum problem: min Before that, we establish the following auxiliary result. The following Lemma shows the importance of the interpolation property. Lemma 4. W e assume interpolation and that f Let us now analyze case 2). Let us now show that b < 1. B.2 Rate of Convergence for Convex Functions In this subsection, we prove a O ( The above bound will be now proven also for case 2).


6d0bf1265ea9635fb4f9d56f16d7efb2-Paper-Conference.pdf

Neural Information Processing Systems

Recent works have shown that line search methods can speed up Stochastic Gradient Descent (SGD) and Adam in modern over-parameterized settings. However, existing line searches may take steps that are smaller than necessary since they require a monotone decrease of the (mini-)batch objective function.


Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models

Neural Information Processing Systems

Recent works have shown that line search methods can speed up Stochastic Gradient Descent (SGD) and Adam in modern over-parameterized settings. However, existing line searches may take steps that are smaller than necessary since they require a monotone decrease of the (mini-)batch objective function. We explore nonmonotone line search methods to relax this condition and possibly accept larger step sizes. Despite the lack of a monotonic decrease, we prove the same fast rates of convergence as in the monotone case. Our experiments show that nonmonotone methods improve the speed of convergence and generalization properties of SGD/Adam even beyond the previous monotone line searches. We propose a POlyak NOnmonotone Stochastic (PoNoS) method, obtained by combining a nonmonotone line search with a Polyak initial step size. Furthermore, we develop a new resetting technique that in the majority of the iterations reduces the amount of backtracks to zero while still maintaining a large initial step size. To the best of our knowledge, a first runtime comparison shows that the epoch-wise advantage of line-search-based methods gets reflected in the overall computational time.


The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

Neural Information Processing Systems

We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, which we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution.


Comparing BFGS and OGR for Second-Order Optimization

Przybysz, Adrian, Kołek, Mikołaj, Sobota, Franciszek, Duda, Jarek

arXiv.org Artificial Intelligence

Across standard test functions and ablations with/without line search, OGR variants match or outperform BFGS in final objective and step efficiency, with particular gains in nonconvex landscapes where saddle handling matters. Exact Hessians (via AD) are used only as an oracle baseline to evaluate estimation quality, not to form steps. II. Online Gradient Regression (OGR) Online Gradient Regression (OGR) is a second-order optimization framework that accelerates stochastic gradient descent (SGD) by online least-squares regression of noisy gradients to infer local curvature and the distance to a stationary point [3]. The central assumption is that, in a small neighborhood, the objective F (θ) is well-approximated by a quadratic model, so the gradient varies approximately linearly with the parameters. OGR maintains exponentially weighted statistics of recent (θ t, g t) pairs and updates a local model each iteration at negligible extra cost compared to computing the gradient itself [2], [3]. A. Direct multivariate approach In given time T, based on recent gradients g t R d and positions θ t R d for t < T, we would like to locally approximate behavior with 2nd order polynomial using parametrization: f (θ) = h + 1 2 (θ p) T H(θ p) f = H(θ p) for Hessian H R d d and p R d position of saddle or extremum. For local behavior we will work on averages with weights w t further decreasing exponentially, defining averages: v null t






high_prob_ls_nonconvex_final

Billy Jin

Neural Information Processing Systems

We consider a line-search method for continuous optimization under a stochastic setting where the function values and gradients are available only through inexact probabilistic zeroth and first-order oracles. These oracles capture multiple standard settings including expected loss minimization and zeroth-order optimization. Moreover, our framework is very general and allows the function and gradient estimates to be biased. The proposed algorithm is simple to describe, easy to implement, and uses these oracles in a similar way as the standard deterministic line search uses exact function and gradient values. Under fairly general conditions on the oracles, we derive a high probability tail bound on the iteration complexity of the algorithm when applied to non-convex smooth functions. These results are stronger than those for other existing stochastic line search methods and apply in more general settings.