AITopics | Gradient Descent

For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent inthe dual space. The generated estimates are additionally averaged in a recursive fashion with specific weights. Mirror descent algorithms havebeen developed in different contexts and they are known to be particularly efficient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error.

algorithm, artificial intelligence, machine learning, (13 more...)

Neural Information Processing Systems

Country: Europe > France (0.15)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

Add feedback

Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Tsuda, Koji, Rätsch, Gunnar, Warmuth, Manfred K. K.

Neural Information Processing SystemsDec-31-2005

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: Online learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.

divergence, matrix, positive definite matrix, (14 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(6 more...)

Genre: Instructional Material > Online (0.50)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

Add feedback

Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Tsuda, Koji, Rätsch, Gunnar, Warmuth, Manfred K.

Neural Information Processing SystemsDec-31-2005

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Ratherthan treating the most general case, we focus on two key applications that exemplify our methods: Online learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials topreserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.

artificial intelligence, machine learning, matrix, (16 more...)

Neural Information Processing Systems

Country:

Europe > Germany (0.46)
North America > United States (0.28)

Genre: Instructional Material > Online (0.50)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

Add feedback

Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

Werfel, Justin, Xie, Xiaohui, Seung, H. S.

Neural Information Processing SystemsDec-31-2004

Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; with sufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective.

algorithm, noise, perturbation, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > San Mateo County > San Mateo (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Education > Educational Setting > Online (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

Werfel, Justin, Xie, Xiaohui, Seung, H. S.

Neural Information Processing SystemsDec-31-2004

Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; with sufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective.

algorithm, noise, perturbation, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > San Mateo County > San Mateo (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Education > Educational Setting > Online (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

Werfel, Justin, Xie, Xiaohui, Seung, H. S.

Neural Information Processing SystemsDec-31-2004

Gradient-following learning methods can encounter problems of implementation inmany applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; withsufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective.

artificial intelligence, machine learning, perturbation, (18 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Education > Educational Setting > Online (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

A Natural Policy Gradient

Kakade, Sham M.

Neural Information Processing SystemsDec-31-2002

We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1 Introduction There has been a growing interest in direct policy-gradient methods for approximate planning in large Markov decision problems (MDPs). Unfortunately, the standard gradient descent rule is noncovariant.

gradient, gradient method, natural gradient, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

A Natural Policy Gradient

Kakade, Sham M.

Neural Information Processing SystemsDec-31-2002

We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1 Introduction There has been a growing interest in direct policy-gradient methods for approximate planning in large Markov decision problems (MDPs). Unfortunately, the standard gradient descent rule is noncovariant.

gradient, gradient method, natural gradient, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

A Natural Policy Gradient

Kakade, Sham M.

Neural Information Processing SystemsDec-31-2002

Sham Kakade Gatsby Computational Neuroscience Unit 17 Queen Square, London, UK WC1N 3AR http://www.gatsby.ucl.ac.uk sham@gatsby.ucl.ac.uk Abstract We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space.Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient ismoving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton etal. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1 Introduction There has been a growing interest in direct policy-gradient methods for approximate planning in large Markov decision problems (MDPs). Unfortunately, the standard gradient descent rule is noncovariant. In this paper, we present a covariant gradient by defining a metric based on the underlying structure of the policy.

artificial intelligence, gradient, machine learning, (16 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Greater London > London (0.24)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Algorithms for Non-negative Matrix Factorization

Lee, Daniel D., Seung, H. Sebastian

Neural Information Processing SystemsDec-31-2001

Nonnegative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithmsfor NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogousto that used for proving convergence of the Expectation Maximization algorithm. The algorithms can also be interpreted as diagonally rescaledgradient descent, where the rescaling factor is optimally chosen to ensure convergence.

algorithm, matrix factorization, update rule, (16 more...)

Neural Information Processing Systems

Country: