Goto

Collaborating Authors

 Statistical Learning


Deterministic Annealing Variant of the EM Algorithm

Neural Information Processing Systems

We present a deterministic annealing variant of the EM algorithm maximum likelihood parameter estimation problems. In ourfor approach, the EM process is reformulated as the problem of minimizing the thermodynamic free energy by using the principle of maximum entropy and statistical mechanics analogy. Unlike simulated deterministicallyannealing approaches, this minimization is performed. Moreover, the derived algorithm, unlike the conventional better estimates free of the initialEM algorithm, can obtain parameter values.


Comparing the prediction accuracy of artificial neural networks and other statistical models for breast cancer survival

Neural Information Processing Systems

The TNM staging system has been used since the early 1960's to predict breast cancer patient outcome. In an attempt to increase prognosticaccuracy, many putative prognostic factors have been identified. Because the TNM stage model can not accommodate thesenew factors, the proliferation of factors in breast cancer has lead to clinical confusion. What is required is a new computerized prognostic system that can test putative prognostic factors and integrate the predictive factors with the TNM variables inorder to increase prognostic accuracy. Using the area under the curve of the receiver operating characteristic, we compare the accuracy of the following predictive models in terms of five year breast cancer-specific survival: pTNM staging system, principal componentanalysis, classification and regression trees, logistic regression, cascade correlation neural network, conjugate gradient descent neural, probabilistic neural network, and backpropagation neural network. Several statistical models are significantly more ac- 1064 HarryB.


Hierarchical Mixtures of Experts Methodology Applied to Continuous Speech Recognition

Neural Information Processing Systems

In this paper, we incorporate the Hierarchical Mixtures of Experts (HME) method of probability estimation, developed by Jordan [1], into an HMMbased continuousspeech recognition system. The resulting system can be thought of as a continuous-density HMM system, but instead of using gaussian mixtures, the HME system employs a large set of hierarchically organized but relatively small neural networks to perform the probability density estimation. The hierarchical structure is reminiscent of a decision tree except for two important differences: each "expert" or neural net performs a "soft" decision rather than a hard decision, and, unlike ordinary decision trees, the parameters of all the neural nets in the HME are automatically trainable using the EM algorithm. We report results on the ARPA 5,OOO-word and 4O,OOO-word Wall Street Journal corpus using HME models. 1 Introduction Recent research has shown that a continuous-density HMM (CD-HMM) system can outperform amore constrained tied-mixture HMM system for large-vocabulary continuous speech recognition (CSR) when a large amount of training data is available [2]. In other work, the utility of decision trees has been demonstrated in classification problems by using the "divide and conquer" paradigm effectively, where a problem is divided into a hierarchical set of simpler problems. We present here a new CD-HMM system which **MIT, Cambridge MA 02139 860 YingZhao, Richard Schwartz, Jason Sroka, John Makhoul has similar properties and possesses the same advantages as decision trees, but has the additional important advantage of having automatically trainable "soft" decision boundaries. 2 Hierarchical Mixtures of Experts The method of Hierarchical Mixtures of Experts (HME) developed recently by Jordan [1] breaks a large scale task into many small ones by partitioning the input space into a nested set of regions, then building a simple but specific model (local expert) in each region.


A Rapid Graph-based Method for Arbitrary Transformation-Invariant Pattern Classification

Neural Information Processing Systems

We present a graph-based method for rapid, accurate search through prototypes for transformation-invariant pattern classification. Ourmethod has in theory the same recognition accuracy as other recent methods based on ''tangent distance" [Simard et al., 1994], since it uses the same categorization rule. Nevertheless ours is significantly faster during classification because far fewer tangent distancesneed be computed. Crucial to the success of our system are 1) a novel graph architecture in which transformation constraints and geometric relationships among prototypes are encoded duringlearning, and 2) an improved graph search criterion, used during classification. These architectural insights are applicable toa wide range of problem domains.


A Study of Parallel Perturbative Gradient Descent

Neural Information Processing Systems

Motivated by difficulties in analog VLSI implementation of back-propagation [Rumelhart et al., 1986] and related algorithms that calculate gradients based on detailed knowledge of the neural network model, there were several similar recent papersproposing to use a parallel [Alspector et al., 1993, Cauwenberghs, 1993, Kirk et al., 1993] or a semi-parallel [Flower and Jabri, 1993] perturbative technique which has the property that it measures (with the physical neural network) rather than calculates the gradient. This technique is closely related to methods of stochastic approximation[Kushner and Clark, 1978] which have been investigated recently by workers in fields other than neural networks.


Spatial Representations in the Parietal Cortex May Use Basis Functions

Neural Information Processing Systems

The parietal cortex is thought to represent the egocentric positions ofobjects in particular coordinate systems. We propose an alternative approach to spatial perception of objects in the parietal cortexfrom the perspective of sensorimotor transformations. The responses of single parietal neurons can be modeled as a gaussian functionof retinal position multiplied by a sigmoid function of eye position, which form a set of basis functions. We show here how these basis functions can be used to generate receptive fields in either retinotopic or head-centered coordinates by simple linear transformations. This raises the possibility that the parietal cortex does not attempt to compute the positions of objects in a particular frameof reference but instead computes a general purpose representation of the retinal location and eye position from which any transformation can be synthesized by direct projection. This representation predicts that hemineglect, a neurological syndrome produced by parietal lesions, should not be confined to egocentric coordinates, but should be observed in multiple frames of reference in single patients, a prediction supported by several experiments.


An Alternative Model for Mixtures of Experts

Neural Information Processing Systems

Hinton Dept. of Computer Science University of Toronto Toronto, M5S lA4, Canada Abstract We propose an alternative model for mixtures of experts which uses a different parametric form for the gating network. The modified model is trained by the EM algorithm. In comparison with earlier models-trained by either EM or gradient ascent-there is no need to select a learning stepsize. We report simulation experiments which show that the new architecture yields faster convergence. We also apply the new model to two problem domains: piecewise nonlinear function approximation and the combination of multiple previously trained classifiers. 1 INTRODUCTION For the mixtures of experts architecture (Jacobs, Jordan, Nowlan & Hinton, 1991), the EM algorithm decouples the learning process in a manner that fits well with the modular structure and yields a considerably improved rate of convergence (Jordan & Jacobs, 1994).


Dynamic Cell Structures

Neural Information Processing Systems

Dynamic Cell Structures (DCS) represent a family of artificial neural architectures suited both for unsupervised and supervised learning. They belong to the recently [Martinetz94] introduced class of Topology Representing Networks (TRN) which build perlectly topology preserving featuremaps. DCS empI'oy a modified Kohonen learning rule in conjunction with competitive Hebbian learning. The Kohonen type learning rule serves to adjust the synaptic weight vectors while Hebbian learning establishes a dynamic lateral connection structure between the units reflecting the topology of the feature manifold.


A Comparison of Discrete-Time Operator Models for Nonlinear System Identification

Neural Information Processing Systems

We present a unifying view of discrete-time operator models used in the context of finite word length linear signal processing. Comparisons are made between the recently presented gamma operator model, and the delta and rho operator models for performing nonlinear system identification and prediction using neural networks. A new model based on an adaptive bilinear transformation which generalizes all of the above models is presented.


Learning Prototype Models for Tangent Distance

Neural Information Processing Systems

Local algorithms such as K-nearest neighbor (NN) perform well in pattern recognition, eventhough they often assume the simplest distance on the pattern space. It has recently been shown (Simard et al. 1993) that the performance can be further improved by incorporating invariance to specific transformations in the underlying distance metric - the so called tangent distance. The resulting classifier, however, canbe prohibitively slow and memory intensive due to the large amount of prototypes that need to be stored and used in the distance comparisons. In this paper we address this problem for the tangent distance algorithm, by developing richmodels for representing large subsets of the prototypes. Our leading example of prototype model is a low-dimensional (12) hyperplane defined by a point and a set of basis or tangent vectors.