AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem

Mohammadi, Hesameddin, Zare, Armin, Soltanolkotabi, Mahdi, Jovanović, Mihailo R.

arXiv.org Artificial IntelligenceDec-26-2019

Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are often poorly understood because of the nonconvex nature of the underlying optimization problems as well as the lack of exact gradient computation. In this paper, we take a step towards demystifying the performance and efficiency of such methods by focusing on the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of a random search method. Our results demonstrate that the required simulation time for achieving $\epsilon$-accuracy in a model-free setup and the total number of function evaluations both scale as $\log \, (1/\epsilon)$.

inequality, lqr problem, matrix, (15 more...)

arXiv.org Artificial Intelligence

1912.11899

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.87)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Quadruply Stochastic Gradient Method for Large Scale Nonlinear Semi-Supervised Ordinal Regression AUC Optimization

Shi, Wanli, Gu, Bin, Li, Xinag, Huang, Heng

arXiv.org Machine LearningDec-23-2019

Semi-supervised ordinal regression (S$^2$OR) problems are ubiquitous in real-world applications, where only a few ordered instances are labeled and massive instances remain unlabeled. Recent researches have shown that directly optimizing concordance index or AUC can impose a better ranking on the data than optimizing the traditional error rate in ordinal regression (OR) problems. In this paper, we propose an unbiased objective function for S$^2$OR AUC optimization based on ordinal binary decomposition approach. Besides, to handle the large-scale kernelized learning problems, we propose a scalable algorithm called QS$^3$ORAO using the doubly stochastic gradients (DSG) framework for functional optimization. Theoretically, we prove that our method can converge to the optimal solution at the rate of $O(1/t)$, where $t$ is the number of iterations for stochastic data sampling. Extensive experimental results on various benchmark and real-world datasets also demonstrate that our method is efficient and effective while retaining similar generalization performance.

algorithm, optimization, ordinal regression, (16 more...)

arXiv.org Machine Learning

1912.11193

Country:

North America > United States (0.04)
North America > Canada > Ontario (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (0.50)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

Direct and indirect reinforcement learning

Guan, Yang, Li, Shengbo Eben, Duan, Jingliang, Li, Jie, Ren, Yangang, Cheng, Bo

arXiv.org Artificial IntelligenceDec-22-2019

Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. In this paper, we classify RL into direct and indirect methods according to how they seek optimal policy of the Markov Decision Process (MDP) problem. The former solves optimal policy by directly maximizing an objective function using gradient descent method, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We take vanilla policy gradient and approximate policy iteration to study their internal relationship, and reveal that both direct and indirect methods can be unified in actor-critic architecture and are equivalent if we always choose stationary state distribution of current policy as initial state distribution of MDP. Finally, we classify the current mainstream RL algorithms and compare the differences between other criteria including value-based and policy-based, model-based and model-free.

algorithm, gradient, state distribution, (12 more...)

arXiv.org Artificial Intelligence

1912.106

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks

Shevchenko, Alexander, Mondelli, Marco

arXiv.org Machine LearningDec-20-2019

The optimization of multilayer neural networks typically leads to a solution with zero training error, yet the landscape can exhibit spurious local minima and the minima can be disconnected. In this paper, we shed light on this phenomenon: we show that the combination of stochastic gradient descent (SGD) and over-parameterization makes the landscape of multilayer neural networks approximately connected and thus more favorable to optimization. More specifically, we prove that SGD solutions are connected via a piecewise linear path, and the increase in loss along this path vanishes as the number of neurons grows large. This result is a consequence of the fact that the parameters found by SGD are increasingly dropout stable as the network becomes wider. We show that, if we remove part of the neurons (and suitably rescale the remaining ones), the change in loss is independent of the total number of neurons, and it depends only on how many neurons are left. Our results exhibit a mild dependence on the input dimension: they are dimension-free for two-layer networks and depend linearly on the dimension for multilayer networks. We validate our theoretical findings with numerical experiments for different architectures and classification tasks.

neural network, neuron, nullnull, (14 more...)

arXiv.org Machine Learning

1912.10095

Country:

North America > United States (0.14)
Europe > Austria (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Second-order Information in First-order Optimization Methods

Hu, Yuzheng, Lin, Licong, Tang, Shange

arXiv.org Machine LearningDec-20-2019

In this paper, we try to uncover the second-order essence of several first-order optimization methods. For Nesterov Accelerated Gradient, we rigorously prove that the algorithm makes use of the difference between past and current gradients, thus approximates the Hessian and accelerates the training. For adaptive methods, we related Adam and Adagrad to a powerful technique in computation statistics---Natural Gradient Descent. These adaptive methods can in fact be treated as relaxations of NGD with only a slight difference lying in the square root of the denominator in the update rules. Skeptical about the effect of such difference, we design a new algorithm---AdaSqrt, which removes the square root in the denominator and scales the learning rate by sqrt(T). Surprisingly, our new algorithm is comparable to various first-order methods(such as SGD and Adam) on MNIST and even beats Adam on CIFAR-10! This phenomenon casts doubt on the convention view that the square root is crucial and training without it will lead to terrible performance. As far as we have concerned, so long as the algorithm tries to explore second or even higher information of the loss surface, then proper scaling of the learning rate alone will guarantee fast training and good generalization performance. To the best of our knowledge, this is the first paper that seriously considers the necessity of square root among all adaptive methods. We believe that our work can shed light on the importance of higher-order information and inspire the design of more powerful algorithms in the future.

algorithm, gradient, information, (13 more...)

arXiv.org Machine Learning

1912.09926

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Add feedback

Pseudo-Encoded Stochastic Variational Inference

Zadeh, Amir, Hessner, Smon, Lim, Yao-Chong, Morency, Louis-Phlippe

arXiv.org Machine LearningDec-19-2019

Posterior inference in directed graphical models is commonly done using a probabilistic encoder (a.k.a inference model) conditioned on the input. Often this inference model is trained jointly with the probabilistic decoder (a.k.a generator model). If probabilistic encoder encounters complexities during training (e.g. suboptimal complxity or parameterization), then learning reaches a suboptimal objective; a phenomena commonly called inference suboptimality. In Variational Inference (VI), optimizing the ELBo using Stochastic Variational Inference (SVI) can eliminate the inference suboptimality (as demonstrated in this paper), however, this solution comes at a substantial computational cost when inference needs to be done on new data points. Essentially, a long sequential chain of gradient updates is required to fully optimize approximate posteriors. In this paper, we present an approach called Pseudo-Encoded Stochastic Variational Inference (PE-SVI), to reduce the inference complexity of SVI during test time. Our approach relies on finding a suitable initial start point for gradient operations, which naturally reduces the required gradient steps. Furthermore, this initialization allows for adopting larger step sizes (compared to random initialization used in SVI), which further reduces the inference time complexity. PE-SVI reaches the same ELBo objective as SVI using less than one percent of required steps, on average.

ariational inference, encoder, svi, (12 more...)

arXiv.org Machine Learning

1912.09423

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Gradient-based training of Gaussian Mixture Models in High-Dimensional Spaces

Gepperth, Alexander, Pfülb, Benedikt

arXiv.org Machine LearningDec-18-2019

We present an approach for efficiently training Gaussian Mixture Models (GMMs) with Stochastic Gradient Descent (SGD) on large amounts of high-dimensional data (e.g., images). In such a scenario, SGD is strongly superior in terms of execution time and memory usage, although it is conceptually more complex than the traditional Expectation-Maximization (EM) algorithm. For enabling SGD training, we propose three novel ideas: First, we show that minimizing an upper bound to the GMM log likelihood instead of the full one is feasible and numerically much more stable way in high-dimensional spaces. Secondly, we propose a new annealing procedure that prevents SGD from converging to pathological local minima. We also propose an SGD-compatible simplification to the full GMM model based on local principal directions, which avoids excessive memory use in high-dimensional spaces due to quadratic growth of covariance matrices. Experiments on several standard image datasets show the validity of our approach, and we provide a publicly available TensorFlow implementation.

dataset, gmm, gmm model, (13 more...)

arXiv.org Machine Learning

1912.09379

Country:

Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Portugal > Coimbra > Coimbra (0.04)
Europe > Germany (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback

Incorporating Unlabeled Data into Distributionally Robust Learning

Frogner, Charlie, Claici, Sebastian, Chien, Edward, Solomon, Justin

arXiv.org Machine LearningDec-17-2019

We study a robust alternative to empirical risk minimization called distributionally robust learning (DRL), in which one learns to perform against an adversary who can choose the data distribution from a specified set of distributions. We illustrate a problem with current DRL formulations, which rely on an overly broad definition of allowed distributions for the adversary, leading to learned classifiers that are unable to predict with any confidence. We propose a solution that incorporates unlabeled data into the DRL problem to further constrain the adversary. We show that this new formulation is tractable for stochastic gradient-based optimization and yields a computable guarantee on the future performance of the learned classifier, analogous to -- but tighter than -- guarantees from conventional DRL. We examine the performance of this new formulation on 14 real datasets and find that it often yields effective classifiers with nontrivial performance guarantees in situations where conventional DRL produces neither. Inspired by these results, we extend our DRL formulation to active learning with a novel, distributionally-robust version of the standard model-change heuristic. Our active learning algorithm often achieves superior learning performance to the original heuristic on real datasets.

data distribution, incorporating unlabeled data, learning, (14 more...)

arXiv.org Machine Learning

1912.07729

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Wisconsin (0.04)
North America > United States > New York (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Active strict saddles in nonsmooth optimization

Davis, Damek, Drusvyatskiy, Dmitriy

arXiv.org Machine LearningDec-15-2019

Nonconvex optimization techniques are increasingly playing a major role in modern signal processing, high dimensional statistics, and machine learning. A driving theme, fully supported by empirical evidence, is that simple algorithms often work well in highly non-convex and even nonsmooth settings. Gradient descent, for example, often finds points with small objective value, despite existence of many highly suboptimal critical points. A growing body of literature provides one compelling explanation for this phenomenon. Namely, typical smooth objective functions provably satisfy the strict saddle property, meaning each critical point is either a local minimizer or has a direction of strictly negative curvature (e.g., [6, 29, 30, 62, 63]).

active manifold, critical point, manifold, (13 more...)

arXiv.org Machine Learning

1912.07146

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > New York > Tompkins County > Ithaca (0.04)
(3 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

End-to-End Learning of Geometrical Shaping Maximizing Generalized Mutual Information

Gümüs, Kadir, Alvarado, Alex, Chen, Bin, Häger, Christian, Agrell, Erik

arXiv.org Artificial IntelligenceDec-11-2019

GMI-based end-to-end learning is shown to be highly nonconvex. We apply gradient descent initialized with Gray-labeled APSK constellations directly to the constellation coordinates. State-of-the-art constellations in 2D and 4D are found providing reach increases up to 26% w.r .t. to QAM. I NTRODUCTION S IGNAL shaping has recently received considerable attention in the literature and is now regarded as a key technique to improve throughput in high-speed fiberoptic systems. Shaping methods can be broadly categorized into probabilistic shaping (PS) and geometric shaping (GS), both having distinct advantages and disadvantages [1]-[3].

constellation, learning, optimization, (12 more...)

arXiv.org Artificial Intelligence

1912.05638

Country:

Europe > Netherlands > North Brabant > Eindhoven (0.05)
Asia > China > Anhui Province > Hefei (0.05)
Europe > Sweden (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

Add feedback