AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Gradient Descent Learns Linear Dynamical Systems

Hardt, Moritz, Ma, Tengyu, Recht, Benjamin

arXiv.org Machine LearningSep-16-2016

For a sequence model to be both expressive and parsimonious in its parameterization, it is crucial to equip the model with memory thus allowing its prediction at time t to depend on previously seen inputs. Recurrent neural networks form an expressive class of nonlinear sequence models. Through their many variants, such as long-short-term-memory [HS97], recurrent neural networks have seen remarkable empirical success in a broad range of domains. At the core, neural networks are typically trained using some form of (stochastic) gradient descent. Even though the training objective is non-convex, it is widely observed in practice that gradient descent quickly approaches a good set of model parameters.

artificial intelligence, assumption, machine learning, (17 more...)

arXiv.org Machine Learning

1609.05191

Country: North America (0.28)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Relativistic Monte Carlo

Lu, Xiaoyu, Perrone, Valerio, Hasenclever, Leonard, Teh, Yee Whye, Vollmer, Sebastian J.

arXiv.org Machine LearningSep-14-2016

Hamiltonian Monte Carlo (HMC) is a popular Markov chain Monte Carlo (MCMC) algorithm that generates proposals for a Metropolis-Hastings algorithm by simulating the dynamics of a Hamiltonian system. However, HMC is sensitive to large time discretizations and performs poorly if there is a mismatch between the spatial geometry of the target distribution and the scales of the momentum distribution. In particular the mass matrix of HMC is hard to tune well. In order to alleviate these problems we propose relativistic Hamiltonian Monte Carlo, a version of HMC based on relativistic dynamics that introduce a maximum velocity on particles. We also derive stochastic gradient versions of the algorithm and show that the resulting algorithms bear interesting relationships to gradient clipping, RMSprop, Adagrad and Adam, popular optimisation methods in deep learning. Based on this, we develop relativistic stochastic gradient descent by taking the zero-temperature limit of relativistic stochastic gradient Hamiltonian Monte Carlo. In experiments we show that the relativistic algorithms perform better than classical Newtonian variants and Adam.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1609.04388

Country: Europe > United Kingdom > England (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Add feedback

Variance-Reduced Proximal Stochastic Gradient Descent for Non-convex Composite optimization

Yu, Xiyu, Tao, Dacheng

arXiv.org Machine LearningSep-11-2016

Here we study non-convex composite optimization: first, a finite-sum of smooth but non-convex functions, and second, a general function that admits a simple proximal mapping. Most research on stochastic methods for composite optimization assumes convexity or strong convexity of each function. In this paper, we extend this problem into the non-convex setting using variance reduction techniques, such as prox-SVRG and prox-SAGA. We prove that, with a constant step size, both prox-SVRG and prox-SAGA are suitable for non-convex composite optimization, and help the problem converge to a stationary point within $O(1/\epsilon)$ iterations. That is similar to the convergence rate seen with the state-of-the-art RSAG method and faster than stochastic gradient descent. Our analysis is also extended into the min-batch setting, which linearly accelerates the convergence. To the best of our knowledge, this is the first analysis of convergence rate of variance-reduced proximal stochastic gradient for non-convex composite optimization.

artificial intelligence, machine learning, variance-reduced proximal stochastic gradient descent, (1 more...)

arXiv.org Machine Learning

1606.00602

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Data Dependent Convergence for Distributed Stochastic Optimization

Bijral, Avleen S.

arXiv.org Machine LearningAug-30-2016

In this dissertation we propose alternative analysis of distributed stochastic gradient descent (SGD) algorithms that rely on spectral properties of the data covariance. As a consequence we can relate questions pertaining to speedups and convergence rates for distributed SGD to the data distribution instead of the regularity properties of the objective functions. More precisely we show that this rate depends on the spectral norm of the sample covariance matrix. An estimate of this norm can provide practitioners with guidance towards a potential gain in algorithm performance. For example many sparse datasets with low spectral norm prove to be amenable to gains in distributed settings. Towards establishing this data dependence we first study a distributed consensus-based SGD algorithm and show that the rate of convergence involves the spectral norm of the sample covariance matrix when the underlying data is assumed to be independent and identically distributed (homogenous). This dependence allows us to identify network regimes that prove to be beneficial for datasets with low sample covariance spectral norm. Existing consensus based analyses prove to be sub-optimal in the homogenous setting. Our analysis method also allows us to find data-dependent convergence rates as we limit the amount of communication. Spreading a fixed amount of data across more nodes slows convergence; in the asymptotic regime we show that adding more machines can help when minimizing twice-differentiable losses. Since the mini-batch results don't follow from the consensus results we propose a different data dependent analysis thereby providing theoretical validation for why certain datasets are more amenable to mini-batching. We also provide empirical evidence for results in this thesis.

algorithm, dataset, matrix, (16 more...)

arXiv.org Machine Learning

1608.08337

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Incremental Nonlinear System Identification and Adaptive Particle Filtering Using Gaussian Process

Bastani, Vahid, Marcenaro, Lucio, Regazzoni, Carlo

arXiv.org Machine LearningAug-30-2016

An incremental/online state dynamic learning method is proposed for identification of the nonlinear Gaussian state space models. The method embeds the stochastic variational sparse Gaussian process as the probabilistic state dynamic model inside a particle filter framework. Model updating is done at measurement sample rate using stochastic gradient descent based optimization implemented in the state estimation filtering loop. The performance of the proposed method is compared with state-of-the-art Gaussian process based batch learning methods. Finally, it is shown that the state estimation performance significantly improves due to the online learning of state dynamics.

artificial intelligence, dynamic model, machine learning, (12 more...)

arXiv.org Machine Learning

1608.08362

Country: North America > United States (0.46)

Genre: Research Report (0.40)

Industry: Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Gentlest Introduction to Tensorflow (Part 2)

#artificialintelligenceAug-26-2016, 15:40:31 GMT

Summary: We show in illustrations how the machine learning'training' process happens in Tensorflow, and tie them back to the Tensorflow code. This paves the way for discussing'training' variations, namely stochastic/mini-batch/batch, and adaptive learning rate gradient descent. The'training' variation code snippets presented serve to reinforce the understanding of the role of Tensorflow placeholders. In the previous article, we used Tensorflow (TF) to build and learn a linear regression model with a single feature so that given a feature value (house size/sqm), we can predict the outcome (house price/). In machine learning (ML) literature, we come across the term'training' very often, let us literally look at what that means in TF.

artificial intelligence, datapoint, machine learning, (16 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.57)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Add feedback

Understanding the Energy and Precision Requirements for Online Learning

Sakr, Charbel, Patil, Ameya, Zhang, Sai, Kim, Yongjune, Shanbhag, Naresh

arXiv.org Machine LearningAug-26-2016

It is well-known that the precision of data, hyperparameters, and internal representations employed in learning systems directly impacts its energy, throughput, and latency. The precision requirements for the training algorithm are also important for systems that learn on-the-fly. Prior work has shown that the data and hyperparameters can be quantized heavily without incurring much penalty in classification accuracy when compared to floating point implementations. These works suffer from two key limitations. First, they assume uniform precision for the classifier and for the training algorithm and thus miss out on the opportunity to further reduce precision. Second, prior works are empirical studies. In this article, we overcome both these limitations by deriving analytical lower bounds on the precision requirements of the commonly employed stochastic gradient descent (SGD) on-line learning algorithm in the specific context of a support vector machine (SVM). Lower bounds on the data precision are derived in terms of the the desired classification accuracy and precision of the hyperparameters used in the classifier. Additionally, lower bounds on the hyperparameter precision in the SGD training algorithm are obtained. These bounds are validated using both synthetic and the UCI breast cancer dataset. Additionally, the impact of these precisions on the energy consumption of a fixed-point SVM with on-line training is studied.

artificial intelligence, energy and precision requirement, machine learning, (1 more...)

arXiv.org Machine Learning

1607.00669

Genre: Research Report (0.40)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

Add feedback

Variance Reduction for Faster Non-Convex Optimization

Allen-Zhu, Zeyuan, Hazan, Elad

arXiv.org Machine LearningAug-24-2016

We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.

artificial intelligence, iteration, machine learning, (17 more...)

arXiv.org Machine Learning

1603.05643

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Single-Layer Neural Networks and Gradient Descent

#artificialintelligenceAug-22-2016, 03:10:03 GMT

This article offers a brief glimpse of the history and basic concepts of machine learning. We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles. Machine learning is one of the hottest and most exciting fields in the modern age of technology. Thanks to machine learning, we enjoy robust email spam filters, convenient text and voice recognition, reliable web search engines, challenging chess players, and, hopefully soon, safe and efficient self-driving cars. Without any doubt, machine learning has become a big and popular field, and sometimes it may be challenging to see the (random) forest for the (decision) trees.

algorithm, artificial intelligence, machine learning, (14 more...)

#artificialintelligence

Country: North America > United States > California > Santa Clara County > Stanford (0.04)

Industry: Leisure & Entertainment > Games > Chess (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.80)

Add feedback

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Liu, Qiang, Wang, Dilin

arXiv.org Machine LearningAug-19-2016

We propose a general purpose variational inference algorithm that forms a natural counterpart of gradient descent for optimization. Our method iteratively transports a set of particles to match the target distribution, by applying a form of functional gradient descent that minimizes the KL divergence. Empirical studies are performed on various real world models and datasets, on which our method is competitive with existing state-of-the-art methods. The derivation of our method is based on a new theoretical result that connects the derivative of KL divergence under smooth transforms with Stein's identity and a recently proposed kernelized Stein discrepancy, which is of independent interest.

artificial intelligence, machine learning, particle, (14 more...)

arXiv.org Machine Learning

1608.04471

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.65)

Add feedback