Gradient Descent
Necessary and Sufficient Conditions for Adaptive, Mirror, and Standard Gradient Methods
We study the impact of the constraint set and gradient geometry on the convergence of online and stochastic methods for convex optimization, providing a characterization of the geometries for which stochastic gradient and adaptive gradient methods are (minimax) optimal. In particular, we show that when the constraint set is quadratically convex, diagonally pre-conditioned stochastic gradient methods are minimax optimal. We further provide a converse that shows that when the constraints are not quadratically convex---for example, any $\ell_p$-ball for $p < 2$---the methods are far from optimal. Based on this, we can provide concrete recommendations for when one should use adaptive, mirror or stochastic gradient methods.
Decentralized Markov Chain Gradient Descent
Sun, Tao, Chen, Tianyi, Sun, Yuejiao, Liao, Qing, Li, Dongsheng
Decentralized stochastic gradient method emerges as a promising solution for solving large-scale machine learning problems. This paper studies the decentralized Markov chain gradient descent (DMGD) algorithm - a variant of the decentralized stochastic gradient methods where the random samples are taken along the trajectory of a Markov chain. This setting is well-motivated when obtaining independent samples is costly or impossible, which excludes the use of the traditional stochastic gradient algorithms. Specifically, we consider the first- and zeroth-order versions of decentralized Markov chain gradient descent over a connected network, where each node only communicates with its neighbors about intermediate results. The nonergodic convergence and the ergodic convergence rate of the proposed algorithms have been rigorously established, and their critical dependences on the network topology and the mixing time of Markov chain have been highlighted. The numerical tests further validate the sample efficiency of our algorithm.
A Time-Dependent TSP Formulation for the Design of an Active Debris Removal Mission using Simulated Annealing
Federici, Lorenzo, Zavoli, Alessandro, Colasurdo, Guido
This paper proposes a formulation of the Active Debris Removal (ADR) Mission Design problem as a modified Time-Dependent Traveling Salesman Problem (TDTSP). The TDTSP is a well-known combinatorial optimization problem, whose solution is the cheapest mono-cyclic tour connecting a number of non-stationary cities in a map. The problem is tackled with an optimization procedure based on Simulated Annealing, that efficiently exploits a natural encoding and a careful choice of mutation operators. The developed algorithm is used to simultaneously optimize the targets sequence and the rendezvous epochs of an impulsive ADR mission. Numerical results are presented for sets comprising up to 20 targets. INTRODUCTION The Traveling Salesman Problem (TSP) is a well-known combinatorial optimization problem, whose solution is the cheapest tour which allows a salesman to visit, only once, a number of cities in a map; the cost of each city-to-city transfer is, typically, the traveled distance or the fuel consumption. Active Debris Removal (ADR) missions can be seen as peculiar instances of the TDTSP, where an active (chaser) spacecraft is asked to visit, that is, to perform a rendezvous, with a certain number of targets (space debris), making the best use of the on-board propellant. Such kind of missions are increasing in popularity among space agencies all over the world, as the sustainability of the extra-atmospheric environment is becoming compromised by the huge amount of "space garbage" now orbiting Earth. A cost-competitive space program would involve the removal of several dozens of small debris with each single mission; such a complex scenario could became feasible only with the best possible use of the propellant on-board of the chaser spacecraft. As a consequence, a well-designed ADR mission would require the optimization of a multi-target rendezvous trajectory. A number of authors dealt with long term or time-free ADR missions aimed at removing a small number of debris from Sun synchronous orbits (at a rate of three to ten per year). These missions heavily rely on J 2 orbital perturbation for the alignment of the orbital planes of consecutive targets before starting the rendezvous maneuver, in order to reduce the mission cost.
Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning
Bolte, Jรฉrรดme, Pauwels, Edouard
The Clarke subdifferential is not suited to tackle nonsmooth deep learning issues: backpropagation, mini-batches and steady states are not properly modelled. As a remedy, we introduce set valued conservative fields as surrogates to standard subdifferential mappings. We study their properties and provide elements of a calculus. Functions having a conservative field are called path differentiable. Convex/concave, semi-algebraic, or Clarke regular Lipschitz continuous functions are path differentiable as their corresponding subdifferentials are conservative. Another concrete and considerable class of examples of conservative fields, which are not subdifferential mappings, is given by the automatic differentiation oracle, as for instance the "subgradients" provided by the backpropagation algorithm in deep learning. Our differential model is eventually used to ensure subsequential convergence for nonsmooth stochastic gradient methods in the tame Lipschitz continuous setting offering the possibility of using mini-batches, the actual backpropagation oracle and $o(1/\log k)$ stepsizes.
#003B Gradient Descent Master Data Science
A term "Deep learning" refers to training neural networks and sometimes very large neural networks. They are massively used for problems of classification and regression. The main goal is to optimize parameters \( w \) and \( b \) ( both in logistic regression and in neural networks). Gradient Descent achieved amazing results in solving these optimization tasks. Not so new, developed in the 1940s, 1950s and 1960s.
A generalization of regularized dual averaging and its dynamics
Excessive computational cost for learning large data and streaming data can be alleviated by using stochastic algorithms, such as stochastic gradient descent and its variants. Recent advances improve stochastic algorithms on convergence speed, adaptivity and structural awareness. However, distributional aspects of these new algorithms are poorly understood, especially for structured parameters. To develop statistical inference in this case, we propose a class of generalized regularized dual averaging (gRDA) algorithms with constant step size, which improves RDA (Xiao, 2010; Flammarion and Bach, 2017). Weak convergence of gRDA trajectories are studied, and as a consequence, for the first time in the literature, the asymptotic distributions for online l1 penalized problems become available. These general results apply to both convex and non-convex differentiable loss functions, and in particular, recover the existing regret bound for convex losses (Nemirovski et al., 2009). As important applications, statistical inferential theory on online sparse linear regression and online sparse principal component analysis are developed, and are supported by extensive numerical analysis. Interestingly, when gRDA is properly tuned, support recovery and central limiting distribution (with mean zero) hold simultaneously in the online setting, which is in contrast with the biased central limiting distribution of batch Lasso (Knight and Fu, 2000). Technical devices, including weak convergence of stochastic mirror descent, are developed as by-products with independent interest. Preliminary empirical analysis of modern image data shows that learning very sparse deep neural networks by gRDA does not necessarily sacrifice testing accuracy.
Using Statistics to Automate Stochastic Optimization
Lang, Hunter, Zhang, Pengchuan, Xiao, Lin
Despite the development of numerous adaptive optimizers, tuning the learning rate of stochastic gradient methods remains a major roadblock to obtaining good practical performance in machine learning. Rather than changing the learning rate at each iteration, we propose an approach that automates the most common hand-tuning heuristic: use a constant learning rate until "progress stops," then drop. We design an explicit statistical test that determines when the dynamics of stochastic gradient descent reach a stationary distribution. This test can be performed easily during training, and when it fires, we decrease the learning rate by a constant multiplicative factor. Our experiments on several deep learning tasks demonstrate that this statistical adaptive stochastic approximation (SASA) method can automatically find good learning rate schedules and match the performance of hand-tuned methods using default settings of its parameters. The statistical testing helps to control the variance of this procedure and improves its robustness.
Trivializations for Gradient-Based Optimization on Manifolds
We introduce a framework to study the transformation of problems with manifold constraints into unconstrained problems through parametrizations in terms of a Euclidean space. We call these parametrizations "trivializations". We prove conditions under which a trivialization is sound in the context of gradient-based optimization and we show how two large families of trivializations have overall favorable properties, but also suffer from a performance issue. We then introduce "dynamic trivializations", which solve this problem, and we show how these form a family of optimization methods that lie between trivializations and Riemannian gradient descent, and combine the benefits of both of them. We then show how to implement these two families of trivializations in practice for different matrix manifolds. To this end, we prove a formula for the gradient of the exponential of matrices, which can be of practical interest on its own. Finally, we show how dynamic trivializations improve the performance of existing methods on standard tasks designed to test long-term memory within neural networks.
Sample Efficient Policy Gradient Methods with Recursive Variance Reduction
Xu, Pan, Gao, Felicia, Gu, Quanquan
Improving the sample efficiency in reinforcement learning has been a long-standing research problem. In this work, we aim to reduce the sample complexity of existing policy gradient methods. We propose a novel policy gradient algorithm called SRVR-PG, which only requires $O(1/\epsilon^{3/2})$ episodes to find an $\epsilon$-approximate stationary point of the nonconcave performance function $J(\boldsymbol{\theta})$ (i.e., $\boldsymbol{\theta}$ such that $\|\nabla J(\boldsymbol{\theta})\|_2^2\leq\epsilon$). This sample complexity improves the best known result $O(1/\epsilon^{5/3})$ for policy gradient algorithms by a factor of $O(1/\epsilon^{1/6})$. In addition, we also propose a variant of SRVR-PG with parameter exploration, which explores the initial policy parameter from a prior probability distribution. We conduct numerical experiments on classic control problems in reinforcement learning to validate the performance of our proposed algorithms.
Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients
Sun, Jun, Chen, Tianyi, Giannakis, Georgios B., Yang, Zaiyue
The present paper develops a novel aggregated gradient approach for distributed machine learning that adaptively compresses the gradient communication. The key idea is to first quantize the computed gradients, and then skip less informative quantized gradient communications by reusing outdated gradients. Quantizing and skipping result in `lazy' worker-server communications, which justifies the term Lazily Aggregated Quantized gradient that is henceforth abbreviated as LAQ. Our LAQ can provably attain the same linear convergence rate as the gradient descent in the strongly convex case, while effecting major savings in the communication overhead both in transmitted bits as well as in communication rounds. Empirically, experiments with real data corroborate a significant communication reduction compared to existing gradient- and stochastic gradient-based algorithms.