Goto

Collaborating Authors

 Europe


LSTM can Solve Hard Long Time Lag Problems

Neural Information Processing Systems

Standard recurrent nets cannot deal with long minimal time lags between relevant signals. Several recent NIPS papers propose alternative methods. We first show: problems used to promote various previous algorithms can be solved more quickly by random weight guessing than by the proposed algorithms. We then use LSTM, our own recent algorithm, to solve a hard problem that can neither be quickly solved by random search nor by any other recurrent net algorithm we are aware of.


Balancing Between Bagging and Bumping

Neural Information Processing Systems

We compare different methods to combine predictions from neural networks trained on different bootstrap samples of a regression problem. One of these methods, introduced in [6] and which we here call balancing, is based on the analysis of the ensemble generalization error into an ambiguity term and a term incorporating generalization performances of individual networks. We show how to estimate these individual errors from the residuals on validation patterns. Weighting factors for the different networks follow from a quadratic programming problem. On a real-world problem concerning the prediction of sales figures and on the well-known Boston housing data set, balancing clearly outperforms other recently proposed alternatives as bagging [1] and bumping [8]. 1 EARLY STOPPING AND BOOTSTRAPPING Stopped training is a popular strategy to prevent overfitting in neural networks.


Adaptively Growing Hierarchical Mixtures of Experts

Neural Information Processing Systems

We propose a novel approach to automatically growing and pruning Hierarchical Mixtures of Experts. The constructive algorithm proposed here enables large hierarchies consisting of several hundred experts to be trained effectively. We show that HME's trained by our automatic growing procedure yield better generalization performance than traditional static and balanced hierarchies. Evaluation of the algorithm is performed (1) on vowel classification and (2) within a hybrid version of the JANUS r9] speech recognition system using a subset of the Switchboard large-vocabulary speaker-independent continuous speech recognition database.


Limitations of Self-organizing Maps for Vector Quantization and Multidimensional Scaling

Neural Information Processing Systems

SaM can be said to do clustering/vector quantization (VQ) and at the same time to preserve the spatial ordering of the input data reflected by an ordering of the code book vectors (cluster centroids) in a one or two dimensional output space, where the latter property is closely related to multidimensional scaling (MDS) in statistics. Although the level of activity and research around the SaM algorithm is quite large (a recent overview by [Kohonen 95] contains more than 1000 citations), only little comparison among the numerous existing variants of the basic approach and also to more traditional statistical techniques of the larger frameworks of VQ and MDS is available. Additionally, there is only little advice in the literature about how to properly use 446 A. Flexer SOM in order to get optimal results in terms of either vector quantization (VQ) or multidimensional scaling or maybe even both of them. To make the notion of SOM being a tool for "data visualization" more precise, the following question has to be answered: Should SOM be used for doing VQ, MDS, both at the same time or none of them? Two recent comprehensive studies comparing SOM either to traditional VQ or MDS techniques separately seem to indicate that SOM is not competitive when used for either VQ or MDS: [Balakrishnan et al. 94J compare SOM to K-means clustering on 108 multivariate normal clustering problems with known clustering solutions and show that SOM performs significantly worse in terms of data points misclassified


On a Modification to the Mean Field EM Algorithm in Factorial Learning

Neural Information Processing Systems

A modification is described to the use of mean field approximations in the E step of EM algorithms for analysing data from latent structure models, as described by Ghahramani (1995), among others. The modification involves second-order Taylor approximations to expectations computed in the E step. The potential benefits of the method are illustrated using very simple latent profile models.


Estimating Equivalent Kernels for Neural Networks: A Data Perturbation Approach

Neural Information Processing Systems

The perturbation method which we have presented overcomes the limitations of standard approaches, which are only appropriate for models with a single layer of adjustable weights, albeit at considerable computational expense. It has the added bonus of automatically taking into account the effect of regularisation techniques such as weight decay. The experimental results illustrate the application of the technique to two simple problems. As expected the number of degrees of freedom in the models is found to be related to the amount of weight decay used during training. The equivalent kernels are found to vary significantly in different regions of input space and the functions reconstructed from the estimated smoother matrices closely match the origna!


Improving the Accuracy and Speed of Support Vector Machines

Neural Information Processing Systems

Support Vector Learning Machines (SVM) are finding application in pattern recognition, regression estimation, and operator inversion for ill-posed problems. Against this very general backdrop, any methods for improving the generalization performance, or for improving the speed in test phase, of SVMs are of increasing interest. In this paper we combine two such techniques on a pattern recognition problem. The method for improving generalization performance (the "virtual support vector" method) does so by incorporating known invariances of the problem. This method achieves a drop in the error rate on 10,000 NIST test digit images of 1.4% to 1.0%.


The CONDENSATION Algorithm - Conditional Density Propagation and Applications to Visual Tracking

Neural Information Processing Systems

The power of sampling methods in Bayesian reconstruction of noisy signals is well known. The extension of sampling to temporal problems is discussed. Efficacy of sampling over time is demonstrated with visual tracking.


Regression with Input-Dependent Noise: A Bayesian Treatment

Neural Information Processing Systems

In most treatments of the regression problem it is assumed that the distribution of target data can be described by a deterministic function of the inputs, together with additive Gaussian noise having constant variance. The use of maximum likelihood to train such models then corresponds to the minimization of a sum-of-squares error function. In many applications a more realistic model would allow the noise variance itself to depend on the input variables. However, the use of maximum likelihood to train such models would give highly biased results. In this paper we show how a Bayesian treatment can allow for an input-dependent variance while overcoming the bias of maximum likelihood.


Gaussian Processes for Bayesian Classification via Hybrid Monte Carlo

Neural Information Processing Systems

The full Bayesian method for applying neural networks to a prediction problem is to set up the prior/hyperprior structure for the net and then perform the necessary integrals. However, these integrals are not tractable analytically, and Markov Chain Monte Carlo (MCMC) methods are slow, especially if the parameter space is high-dimensional. Using Gaussian processes we can approximate the weight space integral analytically, so that only a small number of hyperparameters need be integrated over by MCMC methods. We have applied this idea to classification problems, obtaining excellent results on the real-world problems investigated so far. 1 INTRODUCTION To make predictions based on a set of training data, fundamentally we need to combine our prior beliefs about possible predictive functions with the data at hand. In the Bayesian approach to neural networks a prior on the weights in the net induces a prior distribution over functions.