Goto

Collaborating Authors

 Statistical Learning


Factorial Learning by Clustering Features

Neural Information Processing Systems

We introduce a novel algorithm for factorial learning, motivated by segmentation problems in computational vision, in which the underlying factors correspond to clusters of highly correlated input features. The algorithm derives from a new kind of competitive clustering model, in which the cluster generators compete to explain eachfeature of the data set and cooperate to explain each input example, rather than competing for examples and cooperating onfeatures, as in traditional clustering algorithms. A natural extension of the algorithm recovers hierarchical models of data generated from multiple unknown categories, each with a different, multiplecausal structure. Several simulations demonstrate the power of this approach.


Learning Local Error Bars for Nonlinear Regression

Neural Information Processing Systems

We present a new method for obtaining local error bars for nonlinear regression, i.e., estimates of the confidence in predicted values that depend onthe input. We approach this problem by applying a maximumlikelihood frameworkto an assumed distribution of errors. We demonstrate our method first on computer-generated data with locally varying, normally distributed target noise. We then apply it to laser data from the Santa Fe Time Series Competition where the underlying system noise is known quantization error and the error bars give local estimates of model misspecification. In both cases, the method also provides a weightedregression effectthat improves generalization performance.


Multidimensional Scaling and Data Clustering

Neural Information Processing Systems

Visualizing and structuring pairwise dissimilarity data are difficult combinatorial optimization problemsknown as multidimensional scaling or pairwise data clustering. Algorithms for embedding dissimilarity data set in a Euclidian space, for clustering these data and for actively selecting data to support the clustering process are discussed in the maximum entropy framework. Active data selection provides a strategy to discover structure in a data set efficiently with partially unknown data. 1 Introduction Grouping experimental data into compact clusters arises as a data analysis problem in psychology, linguistics,genetics and other experimental sciences. The data which are supposed to be clustered are either given by an explicit coordinate representation (central clustering) or, in the non-metric case, they are characterized by dissimilarity values for pairs of data points (pairwise clustering). In this paper we study algorithms (i) for embedding non-metric data in a D-dimensional Euclidian space, (ii) for simultaneous clustering and embedding of non-metric data, and (iii) for active data selection to determine a particular cluster structure with minimal number of data queries. All algorithms are derived from the maximum entropy principle (Hertz et al., 1991) which guarantees robust statistics (Tikochinsky et al., 1984).


Bayesian Query Construction for Neural Network Models

Neural Information Processing Systems

If data collection is costly, there is much to be gained by actively selecting particularlyinformative data points in a sequential way. In a Bayesian decision-theoretic framework we develop a query selection criterionwhich explicitly takes into account the intended use of the model predictions. By Markov Chain Monte Carlo methods the necessary quantities can be approximated to a desired precision. Asthe number of data points grows, the model complexity is modified by a Bayesian model selection strategy. The properties oftwo versions of the criterion ate demonstrated in numerical experiments.


An Input Output HMM Architecture

Neural Information Processing Systems

We introduce a recurrent architecture having a modular structure and we formulate a training procedure based on the EM algorithm. The resulting model has similarities to hidden Markov models, but supports recurrent networks processing style and allows to exploit the supervised learning paradigm while using maximum likelihood estimation. 1 INTRODUCTION Learning problems involving sequentially structured data cannot be effectively dealt with static models such as feedforward networks. Recurrent networks allow to model complex dynamical systems and can store and retrieve contextual information in a flexible way. Up until the present time, research efforts of supervised learning for recurrent networks have almost exclusively focused on error minimization by gradient descent methods. Although effective for learning short term memories, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals (Bengio et al., 1994; Mozer, 1992).


Combining Estimators Using Non-Constant Weighting Functions

Neural Information Processing Systems

Volker Tresp*and Michiaki Taniguchi Siemens AG, Central Research Otto-Hahn-Ring 6 81730 Miinchen, Germany Abstract This paper discusses the linearly weighted combination of estimators inwhich the weighting functions are dependent on the input. We show that the weighting functions can be derived either by evaluating the input dependent variance of each estimator or by estimating how likely it is that a given estimator has seen data in the region of the input space close to the input pattern. The latter solutionis closely related to the mixture of experts approach and we show how learning rules for the mixture of experts can be derived from the theory about learning with missing features. The presented approaches are modular since the weighting functions can easily be modified (no retraining) if more estimators are added. Furthermore,it is easy to incorporate estimators which were not derived from data such as expert systems or algorithms. 1 Introduction Instead of modeling the global dependency between input x E D and output y E using a single estimator, it is often very useful to decompose a complex mapping -'\.t the time of the research for this paper, a visiting researcher at the Center for Biological and Computational Learning, MIT.


Statistical Feature Combination for the Evaluation of Game Positions

Journal of Artificial Intelligence Research

This article describes an application of three well-known statistical methods in the field of game-tree search: using a large number of classified Othello positions, feature weights for evaluation functions with a game-phase-independent meaning are estimated by means of logistic regression, Fisher's linear discriminant, and the quadratic discriminant function for normally distributed features. Thereafter, the playing strengths are compared by means of tournaments between the resulting versions of a world-class Othello program. In this application, logistic regression - which is used here for the first time in the context of game playing - leads to better results than the other approaches.


A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics

Neural Information Processing Systems

The recurrent network, containing six continuous-time analog neurons and 42 free parameters (connection strengths and thresholds), is trained to generate time-varying outputs approximating given periodic signals presented to the network. The chip implements a stochastic perturbative algorithm, which observes the error gradient along random directions in the parameter space for error-descent learning. In addition to the integrated learning functions and the generation of pseudo-random perturbations, the chip provides for teacher forcing and long-term storage of the volatile parameters. The network learns a 1 kHz circular trajectory in 100 sec. The chip occupies 2mm x 2mm in a 2JLm CMOS process, and dissipates 1.2 m W. 1 Introduction Exact gradient-descent algorithms for supervised learning in dynamic recurrent networks [1-3] are fairly complex and do not provide for a scalable implementation in a standard 2-D VLSI process. We have implemented a fairly simple and scalable ยทPresent address: Johns Hopkins University, ECE Dept., Baltimore MD 21218-2686.


Unsupervised Learning of Mixtures of Multiple Causes in Binary Data

Neural Information Processing Systems

This paper presents a formulation for unsupervised learning of clusters reflecting multiple causal structure in binary data. Unlike the standard mixture model, a multiple cause model accounts for observed data by combining assertions from many hidden causes, each of which can pertain to varying degree to any subset of the observable dimensions. A crucial issue is the mixing-function for combining beliefs from different cluster-centers in order to generate data reconstructions whose errors are minimized both during recognition and learning. We demonstrate a weakness inherent to the popular weighted sum followed by sigmoid squashing, and offer an alternative form of the nonlinearity. Results are presented demonstrating the algorithm's ability successfully to discover coherent multiple causal representat.ions of noisy test data and in images of printed characters. 1 Introduction The objective of unsupervised learning is to identify patterns or features reflecting underlying regularities in data. Single-cause techniques, including the k-means algorithm and the standard mixture-model (Duda and Hart, 1973), represent clusters of data points sharing similar patterns of Is and Os under the assumption that each data point belongs to, or was generated by, one and only one cluster-center; output activity is constrained to sum to 1. In contrast, a multiple-cause model permits more than one cluster-center to become fully active in accounting for an observed data vector.


Fool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics

Neural Information Processing Systems

Several recurrent networks have been proposed as representations for the task of formal language learning. After training a recurrent network recognize a formal language or predict the next symbol of a sequence, the next logical step is to understand the information processing carried out by the network. Some researchers have begun to extracting finite state machines from the internal state trajectories of their recurrent networks. This paper describes how sensitivity to initial conditions and discrete measurements can trick these extraction methods to return illusory finite state descriptions.