Gradient Descent
Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
Juditsky, Anatoli, Nazin, Alexander, Tsybakov, Alexandre, Vayatis, Nicolas
For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent inthe dual space. The generated estimates are additionally averaged in a recursive fashion with specific weights. Mirror descent algorithms havebeen developed in different contexts and they are known to be particularly efficient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error.
Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Tsuda, Koji, Rรคtsch, Gunnar, Warmuth, Manfred K. K.
We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: Online learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Tsuda, Koji, Rรคtsch, Gunnar, Warmuth, Manfred K.
We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Ratherthan treating the most general case, we focus on two key applications that exemplify our methods: Online learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials topreserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
Werfel, Justin, Xie, Xiaohui, Seung, H. S.
Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; with sufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective.
Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
Werfel, Justin, Xie, Xiaohui, Seung, H. S.
Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; with sufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective.
Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
Werfel, Justin, Xie, Xiaohui, Seung, H. S.
Gradient-following learning methods can encounter problems of implementation inmany applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; withsufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective.
A Natural Policy Gradient
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1 Introduction There has been a growing interest in direct policy-gradient methods for approximate planning in large Markov decision problems (MDPs). Unfortunately, the standard gradient descent rule is noncovariant.
A Natural Policy Gradient
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1 Introduction There has been a growing interest in direct policy-gradient methods for approximate planning in large Markov decision problems (MDPs). Unfortunately, the standard gradient descent rule is noncovariant.
A Natural Policy Gradient
Sham Kakade Gatsby Computational Neuroscience Unit 17 Queen Square, London, UK WC1N 3AR http://www.gatsby.ucl.ac.uk sham@gatsby.ucl.ac.uk Abstract We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space.Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient ismoving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton etal. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1 Introduction There has been a growing interest in direct policy-gradient methods for approximate planning in large Markov decision problems (MDPs). Unfortunately, the standard gradient descent rule is noncovariant. In this paper, we present a covariant gradient by defining a metric based on the underlying structure of the policy.
Algorithms for Non-negative Matrix Factorization
Lee, Daniel D., Seung, H. Sebastian
Nonnegative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithmsfor NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogousto that used for proving convergence of the Expectation Maximization algorithm. The algorithms can also be interpreted as diagonally rescaledgradient descent, where the rescaling factor is optimally chosen to ensure convergence.