Goto

Collaborating Authors

 Statistical Learning


The Efficiency and the Robustness of Natural Gradient Descent Learning Rule

Neural Information Processing Systems

The inverse of the Fisher information matrix is used in the natural gradientdescent algorithm to train single-layer and multi-layer perceptrons. We have discovered a new scheme to represent the Fisher information matrix of a stochastic multi-layer perceptron. Based on this scheme, we have designed an algorithm to compute the natural gradient. When the input dimension n is much larger than the number of hidden neurons, the complexity of this algorithm isof order O(n). It is confirmed by simulations that the natural gradient descent learning rule is not only efficient but also robust. 1 INTRODUCTION The inverse of the Fisher information matrix is required to find the Cramer-Rae lower bound to analyze the performance of an unbiased estimator. It is also needed in the natural gradient learning framework (Amari, 1997) to design statistically efficient algorithms for estimating parameters in general and for training neural networks in particular. In this paper, we assume a stochastic model for multilayer perceptrons.Considering a Riemannian parameter space in which the Fisher information matrix is a metric tensor, we apply the natural gradient learning rule to train single-layer and multi-layer perceptrons. The main difficulty encountered is to compute the inverse of the Fisher information matrix of large dimensions when the input dimension is high. By exploring the structure of the Fisher information matrix and its inverse, we design a fast algorithm with lower complexity to implement the natural gradient learning algorithm.


Competitive On-line Linear Regression

Neural Information Processing Systems

We apply a general algorithm for merging prediction strategies (the Aggregating Algorithm) to the problem of linear regression with the square loss; our main assumption is that the response variable is bounded. It turns out that for this particular problem the Aggregating Algorithmresembles, but is slightly different from, the wellknown ridgeestimation procedure. From general results about the Aggregating Algorithm we deduce a guaranteed bound on the difference betweenour algorithm's performance and the best, in some sense, linear regression function's performance. We show that the AA attains the optimal constant in our bound, whereas the constant attainedby the ridge regression procedure in general can be 4 times worse. 1 INTRODUCTION The usual approach to regression problems is to assume that the data are generated bysome stochastic mechanism and make some, typically very restrictive, assumptions about that stochastic mechanism. In recent years, however, a different approach to this kind of problems was developed (see, e.g., DeSantis et al. [2], Littlestone andWarmuth [7]): in our context, that approach sets the goal of finding an online algorithm that performs not much worse than the best regression function foundoff-line; in other words, it replaces the usual statistical analyses by the competitive analysis of online algorithms. DeSantis et al. [2] performed a competitive analysis of the Bayesian merging scheme for the log-loss prediction game; later Littlestone and Warmuth [7] and Vovk [10] introduced an online algorithm (called the Weighted Majority Algorithm by the Competitive Online Linear Regression 365 former authors) for the simple binary prediction game. These two algorithms (the Bayesian merging scheme and the Weighted Majority Algorithm) are special cases of the Aggregating Algorithm (AA) proposed in [9, 11]. The AA is a member of a wide family of algorithms called "multiplicative weight" or "exponential weight" algorithms. Closerto the topic of this paper, Cesa-Bianchi et al. [1) performed a competitive analysis, under the square loss, of the standard Gradient Descent Algorithm and Kivinen and Warmuth [6] complemented it by a competitive analysis of a modification ofthe Gradient Descent, which they call the Exponentiated Gradient Algorithm.


From Regularization Operators to Support Vector Kernels

Neural Information Processing Systems

Support Vector (SV) Machines for pattern recognition, regression estimation and operator inversion exploit the idea of transforming into a high dimensional feature space where they perform a linear algorithm. Instead of evaluating this map explicitly, one uses Hilbert Schmidt Kernels k(x, y) which correspond to dot products of the mapped data in high dimensional space, i.e. k(x,y) ( I (x) ยท I (y)) (I) with I: .!Rn --*:F denoting the map into feature space. Mostly, this map and many of its properties are unknown. Even worse, so far no general rule was available.


Data-Dependent Structural Risk Minimization for Perceptron Decision Trees

Neural Information Processing Systems

Using displays of line orientations taken from Wolfe's experiments [1992], we study the hypothesis that the distinction between parallel versus serial processes arises from the availability of global information in the internal representations of the visual scene. The model operates in two phases. First, the visual displays are compressed via principal-component-analysis. Second, the compressed data is processed by a target detector module inorder to identify the existence of a target in the display. Our main finding is that targets in displays which were found experimentally tobe processed in parallel can be detected by the system, while targets in experimentally-serial displays cannot. This fundamental difference is explained via variance analysis of the compressed representations, providing a numerical criterion distinguishing parallelfrom serial displays. Our model yields a mapping of response-time slopes that is similar to Duncan and Humphreys's "search surface" [1989], providing an explicit formulation of their intuitive notion of feature similarity. It presents a neural realization ofthe processing that may underlie the classical metaphorical explanations of visual search.


Structural Risk Minimization for Nonparametric Time Series Prediction

Neural Information Processing Systems

The problem of time series prediction is studied within the uniform convergence frameworkof Vapnik and Chervonenkis. The dependence inherent in the temporal structure is incorporated into the analysis, thereby generalizing the available theory for memoryless processes. Finite sample boundsare calculated in terms of covering numbers of the approximating class,and the tradeoff between approximation and estimation is discussed. A complexity regularization approach is outlined, based on Vapnik's method of Structural Risk Minimization, and shown to be applicable inthe context of mixing stochastic processes.


Relative Loss Bounds for Multidimensional Regression Problems

Neural Information Processing Systems

We study online generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax functionthat need to consider the linear activations to all the output neurons. We use distance functions of a certain kind in two completely independent roles in deriving and analyzing online learning algorithms for such tasks. We use one distance function to define a matching loss function for the (possibly multidimensional) transfer function, which allows usto generalize earlier results from one-dimensional to multidimensional outputs.We use another distance function as a tool for measuring progress made by the online updates. This shows how previously studied algorithmssuch as gradient descent and exponentiated gradient fit into a common framework. We evaluate the performance of the algorithms usingrelative loss bounds that compare the loss of the online algoritm to the best off-line predictor from the relevant model class, thus completely eliminating probabilistic assumptions about the data.


Generalization in Decision Trees and DNF: Does Size Matter?

Neural Information Processing Systems

Recent theoretical results for pattern classification with thresholded real-valuedfunctions (such as support vector machines, sigmoid networks,and boosting) give bounds on misclassification probability that do not depend on the size of the classifier, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions).



Probabilistic Inference from Arbitrary Uncertainty using Mixtures of Factorized Generalized Gaussians

Journal of Artificial Intelligence Research

This paper presents a general and efficient framework for probabilistic inference and learning from arbitrary uncertain information. It exploits the calculation properties of finite mixture models, conjugate families and factorization. Both the joint probability density of the variables and the likelihood function of the (objective or subjective) observation are approximated by a special mixture model, in such a way that any desired conditional distribution can be directly obtained without numerical integration. We have developed an extended version of the expectation maximization (EM) algorithm to estimate the parameters of mixture models from uncertain training examples (indirect observations). As a consequence, any piece of exact or uncertain information about both input and output values is consistently handled in the inference and learning stages. This ability, extremely useful in certain situations, is not found in most alternative methods. The proposed framework is formally justified from standard probabilistic principles and illustrative examples are provided in the fields of nonparametric pattern classification, nonlinear regression and pattern completion. Finally, experiments on a real application and comparative results over standard databases provide empirical evidence of the utility of the method in a wide range of applications.


Learning Appearance Based Models: Mixtures of Second Moment Experts

Neural Information Processing Systems

This paper describes a new technique for object recognition based on learning appearance models. The image is decomposed into local regions which are described by a new texture representation called "Generalized Second Moments" that are derived from the output of multiscale, multiorientation filter banks. Class-characteristic local texture features and their global composition is learned by a hierarchical mixture of experts architecture (Jordan & Jacobs). The technique is applied to a vehicle database consisting of 5 general car categories (Sedan, Van with backdoors, Van without backdoors, old Sedan, and Volkswagen Bug). This is a difficult problem with considerable in-class variation.