Goto

Collaborating Authors

 Deep Learning



Does the Wake-sleep Algorithm Produce Good Density Estimators?

Neural Information Processing Systems

The wake-sleep algorithm (Hinton, Dayan, Frey and Neal 1995) is a relatively efficient method of fitting a multilayer stochastic generative model to high-dimensional data. In addition to the top-down connections in the generative model, it makes use of bottom-up connections for approximating the probability distribution over the hidden units given the data, and it trains these bottom-up connections using a simple delta rule. We use a variety of synthetic and real data sets to compare the performance of the wake-sleep algorithm with Monte Carlo and mean field methods for fitting the same generative model and also compare it with other models that are less powerful but easier to fit. 1 INTRODUCTION Neural networks are often used as bottom-up recognition devices that transform input vectors into representations of those vectors in one or more hidden layers. But multilayer networks of stochastic neurons can also be used as top-down generative models that produce patterns with complicated correlational structure in the bottom visible layer. In this paper we consider generative models composed of layers of stochastic binary logistic units. Given a generative model parameterized by top-down weights, there is an obvious way to perform unsupervised learning. The generative weights are adjusted to maximize the probability that the visible vectors generated by the model would match the observed data.


Learning long-term dependencies is not as difficult with NARX networks

Neural Information Processing Systems

It has recently been shown that gradient descent learning algorithms for recurrent neural networks can perform poorly on tasks that involve long-term dependencies. In this paper we explore this problem for a class of architectures called NARX networks, which have powerful representational capabilities. Previous work reported that gradient descent learning is more effective in NARX networks than in recurrent networks with "hidden states". We show that although NARX networks do not circumvent the problem of long-term dependencies, they can greatly improve performance on such problems. We present some experimental'results that show that NARX networks can often retain information for two to three times as long as conventional recurrent networks.



A Smoothing Regularizer for Recurrent Neural Networks

Neural Information Processing Systems

We derive a smoothing regularizer for recurrent network models by requiring robustness in prediction performance to perturbations of the training data. The regularizer can be viewed as a generalization of the first order Tikhonov stabilizer to dynamic models. The closed-form expression of the regularizer covers both time-lagged and simultaneous recurrent nets, with feedforward nets and onelayer linear nets as special cases. We have successfully tested this regularizer in a number of case studies and found that it performs better than standard quadratic weight decay. 1 Introd uction One technique for preventing a neural network from overfitting noisy data is to add a regularizer to the error function being minimized. Regularizers typically smooth the fit to noisy data. Well-established techniques include ridge regression, see (Hoerl & Kennard 1970), and more generally spline smoothing functions or Tikhonov stabilizers that penalize the mth-order squared derivatives of the function being fit, as in (Tikhonov & Arsenin 1977), (Eubank 1988), (Hastie & Tibshirani 1990) and (Wahba 1990). Thes(-ilethods have recently been extended to networks of radial basis functions (Girosi, Jones & Poggio 1995), and several heuristic approaches have been developed for sigmoidal neural networks, for example, quadratic weight decay (Plaut, Nowlan & Hinton 1986), weight elimination (Scalettar & Zee 1988),(Chauvin 1990),(Weigend, Rumelhart & Huberman 1990) and soft weight sharing (Nowlan & Hinton 1992).


Recurrent Neural Networks for Missing or Asynchronous Data

Neural Information Processing Systems

In this paper we propose recurrent neural networks with feedback into the input units for handling two types of data analysis problems. On the one hand, this scheme can be used for static data when some of the input variables are missing. On the other hand, it can also be used for sequential data, when some of the input variables are missing or are available at different frequencies.


Modern Analytic Techniques to Solve the Dynamics of Recurrent Neural Networks

Neural Information Processing Systems

We describe the use of modern analytical techniques in solving the dynamics of symmetric and nonsymmetric recurrent neural networks near saturation. These explicitly take into account the correlations between the post-synaptic potentials, and thereby allow for a reliable prediction of transients. 1 INTRODUCTION Recurrent neural networks have been rather popular in the physics community, because they lend themselves so naturally to analysis with tools from equilibrium statistical mechanics. This was the main theme of physicists between, say, 1985 and 1990. Less familiar to the neural network community is a subsequent wave of theoretical physical studies, dealing with the dynamics of symmetric and nonsymmetric recurrent networks. The strategy here is to try to describe the processes at a reduced level of an appropriate small set of dynamic macroscopic observables.



Forward-backward retraining of recurrent neural networks

Neural Information Processing Systems

This paper describes the training of a recurrent neural network as the letter posterior probability estimator for a hidden Markov model, off-line handwriting recognition system. The network estimates posteriordistributions for each of a series of frames representing sectionsof a handwritten word. The supervised training algorithm, backpropagation through time, requires target outputs to be provided for each frame. Three methods for deriving these targets are presented. A novel method based upon the forwardbackward algorithmis found to result in the recognizer with the lowest error rate. 1 Introduction In the field of off-line handwriting recognition, the goal is to read a handwritten document and produce a machine transcription.


Does the Wake-sleep Algorithm Produce Good Density Estimators?

Neural Information Processing Systems

The wake-sleep algorithm (Hinton, Dayan, Frey and Neal 1995) is a relatively efficientmethod of fitting a multilayer stochastic generative model to high-dimensional data. In addition to the top-down connections inthe generative model, it makes use of bottom-up connections for approximating the probability distribution over the hidden units given the data, and it trains these bottom-up connections using a simple delta rule. We use a variety of synthetic and real data sets to compare the performance ofthe wake-sleep algorithm with Monte Carlo and mean field methods for fitting the same generative model and also compare it with other models that are less powerful but easier to fit. 1 INTRODUCTION Neural networks are often used as bottom-up recognition devices that transform input vectors intorepresentations of those vectors in one or more hidden layers. But multilayer networks ofstochastic neurons can also be used as top-down generative models that produce patterns with complicated correlational structure in the bottom visible layer. In this paper we consider generative models composed of layers of stochastic binary logistic units. Given a generative model parameterized by top-down weights, there is an obvious way to perform unsupervised learning. The generative weights are adjusted to maximize the probability thatthe visible vectors generated by the model would match the observed data.