Goto

Collaborating Authors

 accelerated gradient method


Improving Neural Ordinary Differential Equations with Nesterov's Accelerated Gradient Method

Neural Information Processing Systems

We propose the Nesterov neural ordinary differential equations (NesterovNODEs), whose layers solve the second-order ordinary differential equations (ODEs) limit of Nesterov's accelerated gradient (NAG) method, and a generalization called GNesterovNODEs. Taking the advantage of the convergence rate $\mathcal{O}(1/k^{2})$ of the NAG scheme, GNesterovNODEs speed up training and inference by reducing the number of function evaluations (NFEs) needed to solve the ODEs. We also prove that the adjoint state of a GNesterovNODEs also satisfies a GNesterovNODEs, thus accelerating both forward and backward ODE solvers and allowing the model to be scaled up for large-scale tasks. We empirically corroborate the advantage of GNesterovNODEs on a wide range of practical applications, including point cloud separation, image classification, and sequence modeling. Compared to NODEs, GNesterovNODEs require a significantly smaller number of NFEs while achieving better accuracy across our experiments.


An Accelerated Gradient Method for Convex Smooth Simple Bilevel Optimization

Neural Information Processing Systems

In this paper, we focus on simple bilevel optimization problems, where we minimize a convex smooth objective function over the optimal solution set of another convex smooth constrained optimization problem. We present a novel bilevel optimization method that locally approximates the solution set of the lower-level problem using a cutting plane approach and employs an accelerated gradient-based update to reduce the upper-level objective function over the approximated solution set. We measure the performance of our method in terms of suboptimality and infeasibility errors and provide non-asymptotic convergence guarantees for both error criteria. Specifically, when the feasible set is compact, we show that our method requires at most \mathcal{O}(\max\\{1/\sqrt{\epsilon_{f}}, 1/\epsilon_g\\}) iterations to find a solution that is \epsilon_f -suboptimal and \epsilon_g -infeasible. Moreover, under the additional assumption that the lower-level objective satisfies the r -th Hölderian error bound, we show that our method achieves an iteration complexity of \mathcal{O}(\max\\{\epsilon_{f} {-\frac{2r-1}{2r}},\epsilon_{g} {-\frac{2r-1}{2r}}\\}), which matches the optimal complexity of single-level convex constrained optimization when r 1 .


A Concise Lyapunov Analysis of Nesterov's Accelerated Gradient Method

Liu, Jun

arXiv.org Artificial Intelligence

Among them, Nesterov's accelerated gradient method [7,8] has gained significant attention due to its provable acceleration on general convex functions beyond quadratics. A special focus has been on using dynamical system tools [12,10,3,14] and control-theoretical methods [5,9] for the analysis and design of such algorithms. In the standard textbook [8] by Nesterov, the convergence analysis of accelerated gradient methods is conducted using a technique known as estimating sequences. These are essentially auxiliary comparison functions used to prove the convergence rates of optimization algorithms. As pointed out in [14], estimating sequences are usually constructed inductively and can be difficult to understand and apply. This motivated the Lyapunov analysis in [14], which aims to unify the analysis of a broad class of accelerated algorithms. Despite this comprehensive work, to the best knowledge of the author, a simple and direct Lyapunov analysis of the original scheme of Nesterov's accelerated gradient method is still lacking.


Improving Neural Ordinary Differential Equations with Nesterov's Accelerated Gradient Method

Neural Information Processing Systems

We propose the Nesterov neural ordinary differential equations (NesterovNODEs), whose layers solve the second-order ordinary differential equations (ODEs) limit of Nesterov's accelerated gradient (NAG) method, and a generalization called GNesterovNODEs. Taking the advantage of the convergence rate \mathcal{O}(1/k {2}) of the NAG scheme, GNesterovNODEs speed up training and inference by reducing the number of function evaluations (NFEs) needed to solve the ODEs. We also prove that the adjoint state of a GNesterovNODEs also satisfies a GNesterovNODEs, thus accelerating both forward and backward ODE solvers and allowing the model to be scaled up for large-scale tasks. We empirically corroborate the advantage of GNesterovNODEs on a wide range of practical applications, including point cloud separation, image classification, and sequence modeling. Compared to NODEs, GNesterovNODEs require a significantly smaller number of NFEs while achieving better accuracy across our experiments.


Better Mini-Batch Algorithms via Accelerated Gradient Methods

Neural Information Processing Systems

Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice.


Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Neural Information Processing Systems

Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., \ell_1 -regularizer). Gradient descent methods, though highly scalable and easy to implement, are known to converge slowly on these problems. In this paper, we develop novel accelerated gradient methods for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic optimization with both convex and strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD.


Better Mini-Batch Algorithms via Accelerated Gradient Methods

Neural Information Processing Systems

Mini-batch algorithms have recently received significant attention as a way to speed-up stochastic convex optimization problems. In this paper, we study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up. We propose a novel accelerated gradient algorithm, which deals with this deficiency, and enjoys a uniformly superior guarantee. We conclude our paper with experiments on real-world datasets, which validates our algorithm and substantiates our theoretical insights.


Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Hu, Chonghai, Pan, Weike, Kwok, James T.

Neural Information Processing Systems

Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., $\ell_1$-regularizer). Gradient descent methods, though highly scalable and easy to implement, are known to converge slowly on these problems. In this paper, we develop novel accelerated gradient methods for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic optimization with both convex and strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD.


Better Mini-Batch Algorithms via Accelerated Gradient Methods

Cotter, Andrew, Shamir, Ohad, Srebro, Nati, Sridharan, Karthik

Neural Information Processing Systems

Mini-batch algorithms have recently received significant attention as a way to speed-up stochastic convex optimization problems. In this paper, we study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up. We propose a novel accelerated gradient algorithm, which deals with this deficiency, and enjoys a uniformly superior guarantee. We conclude our paper with experiments on real-world datasets, which validates our algorithm and substantiates our theoretical insights.


NEON+: Accelerated Gradient Methods for Extracting Negative Curvature for Non-Convex Optimization

Xu, Yi, Jin, Rong, Yang, Tianbao

arXiv.org Machine Learning

Accelerated gradient (AG) methods are breakthroughs in convex optimization, improving the convergence rate of the gradient descent method for optimization with smooth functions. However, the analysis of AG methods for non-convex optimization is still limited. It remains an open question whether AG methods from convex optimization can accelerate the convergence of the gradient descent method for finding local minimum of non-convex optimization problems. This paper provides an affirmative answer to this question. In particular, we analyze two renowned variants of AG methods (namely Polyak's Heavy Ball method and Nesterov's Accelerated Gradient method) for extracting the negative curvature from random noise, which is central to escaping from saddle points. By leveraging the proposed AG methods for extracting the negative curvature, we present a new AG algorithm with double loops for non-convex optimization~\footnote{this is in contrast to a single-loop AG algorithm proposed in a recent manuscript~\citep{AGNON}, which directly analyzed the Nesterov's AG method for non-convex optimization and appeared online on November 29, 2017. However, we emphasize that our work is an independent work, which is inspired by our earlier work~\citep{NEON17} and is based on a different novel analysis.}, which converges to second-order stationary point $\x$ such that $\|\nabla f(\x)\|\leq \epsilon$ and $\nabla^2 f(\x)\geq -\sqrt{\epsilon} I$ with $\widetilde O(1/\epsilon^{1.75})$ iteration complexity, improving that of gradient descent method by a factor of $\epsilon^{-0.25}$ and matching the best iteration complexity of second-order Hessian-free methods for non-convex optimization.