Goto

Collaborating Authors

 Technology


Effects of Noise on Convergence and Generalization in Recurrent Networks

Neural Information Processing Systems

We introduce and study methods of inserting synaptic noise into dynamically-driven recurrent neural networks and show that applying acontrolled amount of noise during training may improve convergence and generalization. In addition, we analyze the effects of each noise parameter (additive vs. multiplicative, cumulative vs. non-cumulative, per time step vs. per string) and predict that best overall performance can be achieved by injecting additive noise at each time step. Extensive simulations on learning the dual parity grammar from temporal strings substantiate these predictions.


Learning Many Related Tasks at the Same Time with Backpropagation

Neural Information Processing Systems

Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific informationcan be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multitask backprop generalizes better in real domains.


A Rapid Graph-based Method for Arbitrary Transformation-Invariant Pattern Classification

Neural Information Processing Systems

We present a graph-based method for rapid, accurate search through prototypes for transformation-invariant pattern classification. Ourmethod has in theory the same recognition accuracy as other recent methods based on ''tangent distance" [Simard et al., 1994], since it uses the same categorization rule. Nevertheless ours is significantly faster during classification because far fewer tangent distancesneed be computed. Crucial to the success of our system are 1) a novel graph architecture in which transformation constraints and geometric relationships among prototypes are encoded duringlearning, and 2) an improved graph search criterion, used during classification. These architectural insights are applicable toa wide range of problem domains.



An experimental comparison of recurrent neural networks

Neural Information Processing Systems

Many different discrete-time recurrent neural network architectures havebeen proposed. However, there has been virtually no effort to compare these arch:tectures experimentally. In this paper we review and categorize many of these architectures and compare how they perform on various classes of simple problems including grammatical inference and nonlinear system identification.


Estimating Conditional Probability Densities for Periodic Variables

Neural Information Processing Systems

In this paper we introduce three novel techniques for tackling such problems, and investigate their performance using syntheticdata. We then apply these techniques to the problem of extracting the distribution of wind vector directions from radar scatterometer data gathered by a remote-sensing satellite.


Active Learning with Statistical Models

Neural Information Processing Systems

For many types of learners one can compute the statistically "optimal" wayto select data. We review how these techniques have been used with feedforward neural networks [MacKay, 1992; Cohn, 1994] . We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression areboth efficient and accurate.


Non-linear Prediction of Acoustic Vectors Using Hierarchical Mixtures of Experts

Neural Information Processing Systems

We are concerned in this paper with the application of multiple models, specifically the Hierarchical Mixtures of Experts, to time series prediction, specifically the problem of predicting acoustic vectors for use in speech coding. There have been a number of applications of multiple models in time series prediction. A classic example is the Threshold Autoregressive model (TAR) which was used by Tong & 836 S. R. Waterhouse, A. J. Robinson Lim (1980) to predict sunspot activity. More recently, Lewis, Kay and Stevens (in Weigend & Gershenfeld (1994)) describe the use of Multivariate and Regression Splines (MARS) to the prediction of future values of currency exchange rates. Finally, in speech prediction, Cuperman & Gersho (1985) describe the Switched Inter-frame Vector Prediction (SIVP) method which switches between separate linear predictors trained on different statistical classes of speech.


Visual Speech Recognition with Stochastic Networks

Neural Information Processing Systems

This paper presents ongoing work on a speaker independent visual speech recognition system. The work presented here builds on previous research efforts in this area and explores the potential use of simple hidden Markov models for limited vocabulary, speaker independent visual speech recognition. The task at hand is recognition of the first four English digits, a task with possible applications in car-phone dialing. The images were modeled as mixtures of independent Gaussian distributions, and the temporal dependencies were captured with standard left-to-right hidden Markov models. The results indicate that simple hidden Markov models may be used to successfully recognize relatively unprocessed image sequences.


Hierarchical Mixtures of Experts Methodology Applied to Continuous Speech Recognition

Neural Information Processing Systems

In this paper, we incorporate the Hierarchical Mixtures of Experts (HME) method of probability estimation, developed by Jordan [1], into an HMMbased continuousspeech recognition system. The resulting system can be thought of as a continuous-density HMM system, but instead of using gaussian mixtures, the HME system employs a large set of hierarchically organized but relatively small neural networks to perform the probability density estimation. The hierarchical structure is reminiscent of a decision tree except for two important differences: each "expert" or neural net performs a "soft" decision rather than a hard decision, and, unlike ordinary decision trees, the parameters of all the neural nets in the HME are automatically trainable using the EM algorithm. We report results on the ARPA 5,OOO-word and 4O,OOO-word Wall Street Journal corpus using HME models. 1 Introduction Recent research has shown that a continuous-density HMM (CD-HMM) system can outperform amore constrained tied-mixture HMM system for large-vocabulary continuous speech recognition (CSR) when a large amount of training data is available [2]. In other work, the utility of decision trees has been demonstrated in classification problems by using the "divide and conquer" paradigm effectively, where a problem is divided into a hierarchical set of simpler problems.