Gradient Descent
Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks
We consider the problem of on-line gradient descent learning for general two-layer neural networks. An analytic solution is pre(cid:173) sented and used to investigate the role of the learning rate in con(cid:173) trolling the evolution and convergence of the learning process. Learning in layered neural networks refers to the modification of internal parameters {J} which specify the strength of the interneuron couplings, so as to bring the map fJ implemented by the network as close as possible to a desired map 1. The degree of success is monitored through the generalization error, a measure of the dissimilarity between fJ and 1. Consider maps from an N-dimensional input space e onto a scalar (, as arise in the formulation of classification and regression tasks.
One-unit Learning Rules for Independent Component Analysis
Neural one-unit learning rules for the problem of Independent Com(cid:173) ponent Analysis (ICA) and blind source separation are introduced. In these new algorithms, every ICA neuron develops into a sepa(cid:173) rator that finds one of the independent components. The learning rules use very simple constrained Hebbianjanti-Hebbian learning in which decorrelating feedback may be added. To speed up the convergence of these stochastic gradient descent rules, a novel com(cid:173) putationally efficient fixed-point algorithm is introduced.
Text-Based Information Retrieval Using Exponentiated Gradient Descent
The following investigates the use of single-neuron learning algo(cid:173) rithms to improve the performance of text-retrieval systems that accept natural-language queries. A retrieval process is explained that transforms the natural-language query into the query syntax of a real retrieval system: the initial query is expanded using statis(cid:173) tical and learning techniques and is then used for document ranking and binary classification. The results of experiments suggest that Kivinen and Warmuth's Exponentiated Gradient Descent learning algorithm works significantly better than previous approaches.
The Efficiency and the Robustness of Natural Gradient Descent Learning Rule
The inverse of the Fisher information matrix is used in the natu(cid:173) ral gradient descent algorithm to train single-layer and multi-layer perceptrons. We have discovered a new scheme to represent the Fisher information matrix of a stochastic multi-layer perceptron. Based on this scheme, we have designed an algorithm to compute the natural gradient. When the input dimension n is much larger than the number of hidden neurons, the complexity of this algo(cid:173) rithm is of order O(n). It is confirmed by simulations that the natural gradient descent learning rule is not only efficient but also robust.
Gradient Descent for General Reinforcement Learning
A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement(cid:173) learning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q(cid:173) In addition to these learning, SARSA, and advantage learning. Simulations results are given, and several areas for future research are discussed.
Boosting Algorithms as Gradient Descent
We provide an abstract characterization of boosting algorithms as gradient decsent on cost-functionals in an inner-product function space. We prove convergence of these functional-gradient-descent algorithms under quite weak conditions. Following previous theo(cid:173) retical results bounding the generalization performance of convex combinations of classifiers in terms of general cost functions of the margin, we present a new algorithm (DOOM II) for performing a gradient descent optimization of such cost functions. Experiments on several data sets from the UC Irvine repository demonstrate that DOOM II generally outperforms AdaBoost, especially in high noise situations, and that the overfitting behaviour of AdaBoost is predicted by our cost functions.
Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann diver- gence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one).
Fast Variational Inference for Large-scale Internet Diagnosis
Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 10 4 possible faults from 10 5 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a variational approximation, a mean-field approximation, and the use of stochastic gradient descent to optimize a variational cost function.
Adaptive Online Gradient Descent
We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates T and log T . Furthermore, we show strong optimality of the algorithm.
Topmoumoute Online Natural Gradient Algorithm
Guided by the goal of obtaining an optimization algorithm that is both fast and yielding good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.