AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Necessary and Sufficient Conditions for Adaptive, Mirror, and Standard Gradient Methods

arXiv.org Machine LearningSep-23-2019

We study the impact of the constraint set and gradient geometry on the convergence of online and stochastic methods for convex optimization, providing a characterization of the geometries for which stochastic gradient and adaptive gradient methods are (minimax) optimal. In particular, we show that when the constraint set is quadratically convex, diagonally pre-conditioned stochastic gradient methods are minimax optimal. We further provide a converse that shows that when the constraints are not quadratically convex---for example, any $\ell_p$-ball for $p < 2$---the methods are far from optimal. Based on this, we can provide concrete recommendations for when one should use adaptive, mirror or stochastic gradient methods.

convex, gradient method, quadratically convex, (15 more...)

arXiv.org Machine Learning

1909.10455

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

Decentralized Markov Chain Gradient Descent

Sun, Tao, Chen, Tianyi, Sun, Yuejiao, Liao, Qing, Li, Dongsheng

arXiv.org Machine LearningSep-23-2019

Decentralized stochastic gradient method emerges as a promising solution for solving large-scale machine learning problems. This paper studies the decentralized Markov chain gradient descent (DMGD) algorithm - a variant of the decentralized stochastic gradient methods where the random samples are taken along the trajectory of a Markov chain. This setting is well-motivated when obtaining independent samples is costly or impossible, which excludes the use of the traditional stochastic gradient algorithms. Specifically, we consider the first- and zeroth-order versions of decentralized Markov chain gradient descent over a connected network, where each node only communicates with its neighbors about intermediate results. The nonergodic convergence and the ergodic convergence rate of the proposed algorithms have been rigorously established, and their critical dependences on the network topology and the mixing time of Markov chain have been highlighted. The numerical tests further validate the sample efficiency of our algorithm.

algorithm, markov chain, optimization, (15 more...)

arXiv.org Machine Learning

1909.10238

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Asia > Middle East > Jordan (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

A Time-Dependent TSP Formulation for the Design of an Active Debris Removal Mission using Simulated Annealing

Federici, Lorenzo, Zavoli, Alessandro, Colasurdo, Guido

arXiv.org Artificial IntelligenceSep-23-2019

This paper proposes a formulation of the Active Debris Removal (ADR) Mission Design problem as a modified Time-Dependent Traveling Salesman Problem (TDTSP). The TDTSP is a well-known combinatorial optimization problem, whose solution is the cheapest mono-cyclic tour connecting a number of non-stationary cities in a map. The problem is tackled with an optimization procedure based on Simulated Annealing, that efficiently exploits a natural encoding and a careful choice of mutation operators. The developed algorithm is used to simultaneously optimize the targets sequence and the rendezvous epochs of an impulsive ADR mission. Numerical results are presented for sets comprising up to 20 targets. INTRODUCTION The Traveling Salesman Problem (TSP) is a well-known combinatorial optimization problem, whose solution is the cheapest tour which allows a salesman to visit, only once, a number of cities in a map; the cost of each city-to-city transfer is, typically, the traveled distance or the fuel consumption. Active Debris Removal (ADR) missions can be seen as peculiar instances of the TDTSP, where an active (chaser) spacecraft is asked to visit, that is, to perform a rendezvous, with a certain number of targets (space debris), making the best use of the on-board propellant. Such kind of missions are increasing in popularity among space agencies all over the world, as the sustainability of the extra-atmospheric environment is becoming compromised by the huge amount of "space garbage" now orbiting Earth. A cost-competitive space program would involve the removal of several dozens of small debris with each single mission; such a complex scenario could became feasible only with the best possible use of the propellant on-board of the chaser spacecraft. As a consequence, a well-designed ADR mission would require the optimization of a multi-target rendezvous trajectory. A number of authors dealt with long term or time-free ADR missions aimed at removing a small number of debris from Sun synchronous orbits (at a rate of three to ten per year). These missions heavily rely on J 2 orbital perturbation for the alignment of the orbital planes of consecutive targets before starting the rendezvous maneuver, in order to reduce the mission cost.

algorithm, formulation, optimization problem, (13 more...)

arXiv.org Artificial Intelligence

1909.10427

Country:

Europe > Italy > Lazio > Rome (0.04)
North America > United States > Massachusetts (0.04)

Genre: Research Report (0.82)

Industry:

Aerospace & Defense (0.68)
Transportation (0.68)
Energy (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning

Bolte, Jérôme, Pauwels, Edouard

arXiv.org Artificial IntelligenceSep-23-2019

The Clarke subdifferential is not suited to tackle nonsmooth deep learning issues: backpropagation, mini-batches and steady states are not properly modelled. As a remedy, we introduce set valued conservative fields as surrogates to standard subdifferential mappings. We study their properties and provide elements of a calculus. Functions having a conservative field are called path differentiable. Convex/concave, semi-algebraic, or Clarke regular Lipschitz continuous functions are path differentiable as their corresponding subdifferentials are conservative. Another concrete and considerable class of examples of conservative fields, which are not subdifferential mappings, is given by the automatic differentiation oracle, as for instance the "subgradients" provided by the backpropagation algorithm in deep learning. Our differential model is eventually used to ensure subsequential convergence for nonsmooth stochastic gradient methods in the tame Lipschitz continuous setting offering the possibility of using mini-batches, the actual backpropagation oracle and $o(1/\log k)$ stepsizes.

automatic differentiation, conservative field, differentiation, (15 more...)

arXiv.org Artificial Intelligence

1909.103

Country:

Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
North America > United States > Illinois (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

#003B Gradient Descent Master Data Science

#artificialintelligenceSep-22-2019, 07:17:37 GMT

A term "Deep learning" refers to training neural networks and sometimes very large neural networks. They are massively used for problems of classification and regression. The main goal is to optimize parameters $ w $ and $ b $ ( both in logistic regression and in neural networks). Gradient Descent achieved amazing results in solving these optimization tasks. Not so new, developed in the 1940s, 1950s and 1960s.

gradient descent, gradient descent master data science, neural network, (4 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.84)

Add feedback

A generalization of regularized dual averaging and its dynamics

Chao, Shih-Kang, Cheng, Guang

arXiv.org Machine LearningSep-22-2019

Excessive computational cost for learning large data and streaming data can be alleviated by using stochastic algorithms, such as stochastic gradient descent and its variants. Recent advances improve stochastic algorithms on convergence speed, adaptivity and structural awareness. However, distributional aspects of these new algorithms are poorly understood, especially for structured parameters. To develop statistical inference in this case, we propose a class of generalized regularized dual averaging (gRDA) algorithms with constant step size, which improves RDA (Xiao, 2010; Flammarion and Bach, 2017). Weak convergence of gRDA trajectories are studied, and as a consequence, for the first time in the literature, the asymptotic distributions for online l1 penalized problems become available. These general results apply to both convex and non-convex differentiable loss functions, and in particular, recover the existing regret bound for convex losses (Nemirovski et al., 2009). As important applications, statistical inferential theory on online sparse linear regression and online sparse principal component analysis are developed, and are supported by extensive numerical analysis. Interestingly, when gRDA is properly tuned, support recovery and central limiting distribution (with mean zero) hold simultaneously in the online setting, which is in contrast with the biased central limiting distribution of batch Lasso (Knight and Fu, 2000). Technical devices, including weak convergence of stochastic mirror descent, are developed as by-products with independent interest. Preliminary empirical analysis of modern image data shows that learning very sparse deep neural networks by gRDA does not necessarily sacrifice testing accuracy.

algorithm, convergence, theorem 3, (14 more...)

arXiv.org Machine Learning

1909.10072

Country:

Europe > Jersey (0.14)
North America > United States > Rhode Island > Providence County > Providence (0.14)
North America > United States > Missouri > Boone County > Columbia (0.14)
(12 more...)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Add feedback

Using Statistics to Automate Stochastic Optimization

Lang, Hunter, Zhang, Pengchuan, Xiao, Lin

arXiv.org Artificial IntelligenceSep-21-2019

Despite the development of numerous adaptive optimizers, tuning the learning rate of stochastic gradient methods remains a major roadblock to obtaining good practical performance in machine learning. Rather than changing the learning rate at each iteration, we propose an approach that automates the most common hand-tuning heuristic: use a constant learning rate until "progress stops," then drop. We design an explicit statistical test that determines when the dynamics of stochastic gradient descent reach a stationary distribution. This test can be performed easily during training, and when it fires, we decrease the learning rate by a constant multiplicative factor. Our experiments on several deep learning tasks demonstrate that this statistical adaptive stochastic approximation (SASA) method can automatically find good learning rate schedules and match the performance of hand-tuned methods using default settings of its parameters. The statistical testing helps to control the variance of this procedure and improves its robustness.

estimator, experiment, statistics, (17 more...)

arXiv.org Artificial Intelligence

1909.09785

Country:

North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.47)

Industry: Education > Educational Setting > Online (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Trivializations for Gradient-Based Optimization on Manifolds

Lezcano-Casado, Mario

arXiv.org Machine LearningSep-20-2019

We introduce a framework to study the transformation of problems with manifold constraints into unconstrained problems through parametrizations in terms of a Euclidean space. We call these parametrizations "trivializations". We prove conditions under which a trivialization is sound in the context of gradient-based optimization and we show how two large families of trivializations have overall favorable properties, but also suffer from a performance issue. We then introduce "dynamic trivializations", which solve this problem, and we show how these form a family of optimization methods that lie between trivializations and Riemannian gradient descent, and combine the benefits of both of them. We then show how to implement these two families of trivializations in practice for different matrix manifolds. To this end, we prove a formula for the gradient of the exponential of matrices, which can be of practical interest on its own. Finally, we show how dynamic trivializations improve the performance of existing methods on standard tasks designed to test long-term memory within neural networks.

lie group, manifold, trivialization, (15 more...)

arXiv.org Machine Learning

1909.09501

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.63)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Sample Efficient Policy Gradient Methods with Recursive Variance Reduction

Xu, Pan, Gao, Felicia, Gu, Quanquan

arXiv.org Machine LearningSep-18-2019

Improving the sample efficiency in reinforcement learning has been a long-standing research problem. In this work, we aim to reduce the sample complexity of existing policy gradient methods. We propose a novel policy gradient algorithm called SRVR-PG, which only requires $O(1/\epsilon^{3/2})$ episodes to find an $\epsilon$-approximate stationary point of the nonconcave performance function $J(\boldsymbol{\theta})$ (i.e., $\boldsymbol{\theta}$ such that $\|\nabla J(\boldsymbol{\theta})\|_2^2\leq\epsilon$). This sample complexity improves the best known result $O(1/\epsilon^{5/3})$ for policy gradient algorithms by a factor of $O(1/\epsilon^{1/6})$. In addition, we also propose a variant of SRVR-PG with parameter exploration, which explores the initial policy parameter from a prior probability distribution. We conduct numerical experiments on classic control problems in reinforcement learning to validate the performance of our proposed algorithms.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

1909.0861

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Add feedback

Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients

Sun, Jun, Chen, Tianyi, Giannakis, Georgios B., Yang, Zaiyue

arXiv.org Machine LearningSep-17-2019

The present paper develops a novel aggregated gradient approach for distributed machine learning that adaptively compresses the gradient communication. The key idea is to first quantize the computed gradients, and then skip less informative quantized gradient communications by reusing outdated gradients. Quantizing and skipping result in `lazy' worker-server communications, which justifies the term Lazily Aggregated Quantized gradient that is henceforth abbreviated as LAQ. Our LAQ can provably attain the same linear convergence rate as the gradient descent in the strongly convex case, while effecting major savings in the communication overhead both in transmitted bits as well as in communication rounds. Empirically, experiments with real data corroborate a significant communication reduction compared to existing gradient- and stochastic gradient-based algorithms.

artificial intelligence, communication, machine learning, (17 more...)

arXiv.org Machine Learning

1909.07588

Country:

Europe (1.00)
Asia > China (0.68)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.65)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback