Goto

Collaborating Authors

 Statistical Learning


A Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction

Neural Information Processing Systems

Researchers often try to understand-post hoc-representations that emerge in the hidden layers of a neural net following training. Interpretation is difficult because these representations are typically highly distributed and continuous. By "continuous," wemean that if one constructed a scatterplot over the hidden unit activity space of patterns obtained in response to various inputs, examination at any scale would reveal the patterns to be broadly distributed over the space.


Operations for Learning with Graphical Models

Journal of Artificial Intelligence Research

This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Well-known examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models are extended to model data analysis and empirical learning using the notation of plates. Graphical operations for simplifying and manipulating a problem are provided including decomposition, differentiation, andthe manipulation of probability models from the exponential family. Two standard algorithm schemas for learning are reviewed in a graphical framework: Gibbs sampling and the expectation maximizationalgorithm. Using these operations and schemas, some popular algorithms can be synthesized from their graphical specification. This includes versions of linear regression, techniques for feed-forward networks, and learning Gaussian and discrete Bayesian networks from data. The paper concludes by sketching some implications for data analysis and summarizing how some popular algorithms fall within the framework presented. The main original contributions here are the decompositiontechniques and the demonstration that graphical models provide a framework for understanding and developing complex learning algorithms.


Random Worlds and Maximum Entropy

Journal of Artificial Intelligence Research

Given a knowledge base KB containing first-order and statistical facts, we consider a principled method, called the random-worlds method, for computing a degree of belief that some formula Phi holds given KB. If we are reasoning about a world or system consisting of N individuals, then we can consider all possible worlds, or first-order models, withdomain {1,...,N} that satisfy KB, and compute thefraction of them in which Phi is true. We define the degree of belief to be the asymptotic value of this fraction as N grows large. We show that when the vocabulary underlying Phi andKB uses constants and unary predicates only, we can naturally associate an entropy with each world. As N grows larger,there are many more worlds with higher entropy. Therefore, we can usea maximum-entropy computation to compute the degree of belief. This result is in a similar spirit to previous work in physics and artificial intelligence, but is far more general. Of equal interest to the result itself are the limitations on its scope. Most importantly, the restriction to unary predicates seems necessary. Although the random-worlds method makes sense in general, the connection to maximum entropy seems to disappear in the non-unary case. These observations suggest unexpected limitations to the applicability of maximum-entropy methods.


Machine Learning, Neural and Statistical Classification

Classics

This book (originally published in 1994 by Ellis Horwood) is now out of print. The copyright now resides with the editors who have decided to make the material freely available on the web.This book is based on the EC (ESPRIT) project StatLog which compare and evaluated a range of classification techniques, with an assessment of their merits, disadvantages and range of application. This integrated volume provides a concise introduction to each method, and reviews comparative trials in large-scale commercial and industrial problems. It makes accessible to a wide range of workers the complex issue of classification as approached through machine learning, statistics and neural networks, encouraging a cross-fertilization between these discplines.


A Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks

Neural Information Processing Systems

Typical methods for gradient descent in neural network learning involve calculation of derivatives based on a detailed knowledge of the network model. This requires extensive, time consuming calculations for each pattern presentation and high precision that makes it difficult to implement in VLSI. We present here a perturbation technique that measures, not calculates, the gradient. Since the technique uses the actual network as a measuring device, errors in modeling neuron activation and synaptic weights do not cause errors in gradient descent. The method is parallel in nature and easy to implement in VLSI. We describe the theory of such an algorithm, an analysis of its domain of applicability, some simulations using it and an outline of a hardware implementation.


Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors

Neural Information Processing Systems

We propose a very simple, and well principled way of computing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating the principal eigenvalue(s) and eigenvector(s) of the objective function's second derivative matrix (Hessian), which does not require to even calculate the Hessian. Several other applications of this technique are proposed for speeding up learning, or for eliminating useless parameters. 1 INTRODUCTION Choosing the appropriate learning rate, or step size, in a gradient descent procedure such as backpropagation, is simultaneously one of the most crucial and expertintensive part of neural-network learning. We propose a method for computing the best step size which is both well-principled, simple, very cheap computationally, and, most of all, applicable to online training with large networks and data sets.


Non-Linear Dimensionality Reduction

Neural Information Processing Systems

A method for creating a nonlinear encoder-decoder for multidimensional data with compact representations is presented. The commonly used technique of autoassociation is extended to allow nonlinear representations, and an objective function which penalizes activations of individual hidden units is shown to result in minimum dimensional encodings with respect to allowable error in reconstruction. 1 INTRODUCTION Reducing dimensionality of data with minimal information loss is important for feature extraction, compact coding and computational efficiency. The data can be tranformed into "good" representations for further processing, constraints among feature variables may be identified, and redundancy eliminated. Many algorithms are exponential in the dimensionality of the input, thus even reduction by a single dimension may provide valuable computational savings. Autoassociating feed forward networks with one hidden layer have been shown to extract the principal components of the data (Baldi & Hornik, 1988). Such networks have been used to extract features and develop compact encodings of the data (Cottrell, Munro & Zipser, 1989). Principal Components Analysis projects the data into a linear subspace -email: demers@cs.ucsd.edu


Weight Space Probability Densities in Stochastic Learning: II. Transients and Basin Hopping Times

Neural Information Processing Systems

In stochastic learning, weights are random variables whose time evolution is governed by a Markov process. We summarize the theory of the time evolution of P, and give graphical examples of the time evolution that contrast the behavior of stochastic learning with true gradient descent (batch learning). Finally, we use the formalism to obtain predictions of the time required for noise-induced hopping between basins of different optima. We compare the theoretical predictions with simulations of large ensembles of networks for simple problems in supervised and unsupervised learning. Despite the recent application of convergence theorems from stochastic approximation theory to neural network learning (Oja 1982, White 1989) there remain outstanding questions about the search dynamics in stochastic learning.


Efficient Pattern Recognition Using a New Transformation Distance

Neural Information Processing Systems

Memory-based classification algorithms such as radial basis functions or K-nearest neighbors typically rely on simple distances (Euclidean, dot product...), which are not particularly meaningful on pattern vectors. More complex, better suited distance measures are often expensive and rather ad-hoc (elastic matching, deformable templates). We propose a new distance measure which (a) can be made locally invariant to any set of transformations of the input and (b) can be computed efficiently. We tested the method on large handwritten character databases provided by the Post Office and the NIST. Using invariances with respect to translation, rotation, scaling, shearing and line thickness, the method consistently outperformed all other systems tested on the same databases.


Analog VLSI Implementation of Multi-dimensional Gradient Descent

Neural Information Processing Systems

The implementation uses noise injection and multiplicative correlation to estimate derivatives, as in [Anderson, Kerns 92]. One intended application of this technique is setting circuit parameters on-chip automatically, rather than manually [Kirk 91]. Gradient descent optimization may be used to adjust synapse weights for a backpropagation or other on-chip learning implementation. The approach combines the features of continuous multidimensional gradient descent and the potential for an annealing style of optimization. We present data measured from our analog VLSI implementation. 1 Introduction This work is similar to [Anderson, Kerns 92], but represents two advances. First, we describe the extension of the technique to multiple dimensions. Second, we demonstrate an implementation of the multidimensional technique in analog VLSI, and provide results measured from the chip. Unlike previous work using noise sources in adaptive systems, we use the noise as a means of estimating the gradient of a function f(y), rather than performing an annealing process [Alspector 88]. We also estimate gr-;:dients continuously in position and time, in contrast to [Umminger 89] and [J abri 91], which utilize discrete position gradient estimates.