line search
The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms
We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, which we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution.
6d0bf1265ea9635fb4f9d56f16d7efb2-Supplemental-Conference.pdf
Supplementary Materials for "Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models" Appendix A The Algorithm Appendix B Convergence Rates Appendix B.1 Rate of Convergence for Strongly Convex Functions Appendix B.2 Rate of Convergence for Convex Functions Appendix B.3 Rate of Convergence for Functions Satisfying the PL Condition Appendix B.4 Common Lemmas Appendix B.5 The Polyak Step Size is Bounded Appendix C Experimental details Appendix D Plots Completing the Figures in the Main Paper Appendix D.1 Comparison between PoNoS and the state-of-the-art Appendix D.2 A New Resetting Technique Appendix D.3 Time Comparison Appendix D.4 Experiments on Convex Losses Appendix D.5 Experiments on Transformers Appendix E Additional Plots Appendix E.1 Study on the Choice of c: Theory (0.5) vs Practice (0.1) Appendix E.2 Study on the Line Search Choice: V arious Nonmonotone Adaptations Appendix E.3 Zoom in on the Amount of Backtracks Appendix E.4 Study on the Choice of η In this section, we give the details of our proposed algorithm PoNoS. Training machine learning models (e.g., neural networks) entails solving the following finite sum problem: min Before that, we establish the following auxiliary result. The following Lemma shows the importance of the interpolation property. Lemma 4. W e assume interpolation and that f Let us now analyze case 2). Let us now show that b < 1. B.2 Rate of Convergence for Convex Functions In this subsection, we prove a O ( The above bound will be now proven also for case 2).
6d0bf1265ea9635fb4f9d56f16d7efb2-Paper-Conference.pdf
Recent works have shown that line search methods can speed up Stochastic Gradient Descent (SGD) and Adam in modern over-parameterized settings. However, existing line searches may take steps that are smaller than necessary since they require a monotone decrease of the (mini-)batch objective function.
e8f2779682fd11fa2067beffc27a9192-Supplemental.pdf
In this analysis, we assume that evaluating the GP prior mean and kernel functions (and the corresponding derivatives) takesO(1)time. For each fantasy model, we need to compute the posterior mean and covariance matrix for the L points (x,w1:L), on which we draw the sample paths. This results in a total cost ofO(KML2)to generate all samples. The SAA approach trades a stochastic optimization problem with a deterministic approximation, which can be efficiently optimized. Suppose that we are interested in the optimization problemminxEω[h(x,ω)].