Goto

Collaborating Authors

 Bayesian Learning


Gaussian Processes for Bayesian Classification via Hybrid Monte Carlo

Neural Information Processing Systems

The full Bayesian method for applying neural networks to a prediction problem is to set up the prior/hyperprior structure for the net and then perform the necessary integrals. However, these integrals are not tractable analytically, and Markov Chain Monte Carlo (MCMC) methods are slow, especially if the parameter space is high-dimensional. Using Gaussian processes we can approximate the weight space integral analytically, so that only a small number of hyperparameters need be integrated over by MCMC methods. We have applied this idea to classification problems, obtaining excellent results on the real-world problems investigated so far. 1 INTRODUCTION To make predictions based on a set of training data, fundamentally we need to combine our prior beliefs about possible predictive functions with the data at hand. In the Bayesian approach to neural networks a prior on the weights in the net induces a prior distribution over functions.


Bayesian Model Comparison by Monte Carlo Chaining

Neural Information Processing Systems

Neural Computing Research Group Aston University, Birmingham, B4 7ET, U.K. http://www.ncrg.aston.ac.uk/ Abstract The techniques of Bayesian inference have been applied with great success to many problems in neural computing including evaluation of regression functions, determination of error bars on predictions, and the treatment of hyper-parameters. However, the problem of model comparison is a much more challenging one for which current techniques have significant limitations. In this paper we show how an extended form of Markov chain Monte Carlo, called chaining, is able to provide effective estimates of the relative probabilities of different models. We present results from the robot arm problem and compare them with the corresponding results obtained using the standard Gaussian approximation framework. Initially this is chosen to be some prior distribution p(wIM), which can be combined with a likelihood function p( Dlw, M) using Bayes' theorem to give a posterior distribution p(wID, M) in the form (ID M) p(Dlw,M)p(wIM) (1) p w, p(DIM) where D is the data set. Predictions of the model are obtained by performing integrations weighted by the posterior distribution.


Computing with Infinite Networks

Neural Information Processing Systems

For neural networks with a wide class of weight-priors, it can be shown that in the limit of an infinite number of hidden units the prior over functions tends to a Gaussian process. In this paper analytic forms are derived for the covariance function of the Gaussian processes corresponding to networks with sigmoidal and Gaussian hidden units. This allows predictions to be made efficiently using networks with an infinite number of hidden units, and shows that, somewhat paradoxically, it may be easier to compute with infinite networks than finite ones. 1 Introduction To someone training a neural network by maximizing the likelihood of a finite amount of data it makes no sense to use a network with an infinite number of hidden units; the network will "overfit" the data and so will be expected to generalize poorly. However, the idea of selecting the network size depending on the amount of training data makes little sense to a Bayesian; a model should be chosen that reflects the understanding of the problem, and then application of Bayes' theorem allows inference to be carried out (at least in theory) after the data is observed. In the Bayesian treatment of neural networks, a question immediately arises as to how many hidden units are believed to be appropriate for a task. Neal (1996) has argued compellingly that for real-world problems, there is no reason to believe that neural network models should be limited to nets containing only a "small" number of hidden units. He has shown that it is sensible to consider a limit where the number of hidden units in a net tends to infinity, and that good predictions can be obtained from such models using the Bayesian machinery. He has also shown that for fixed hyperparameters, a large class of neural network models will converge to a Gaussian process prior over functions in the limit of an infinite number of hidden units.


A Mean Field Algorithm for Bayes Learning in Large Feed-forward Neural Networks

Neural Information Processing Systems

In the Bayes approach to statistical inference [Berger, 1985] one assumes that the prior uncertainty about parameters of an unknown data generating mechanism can be encoded in a probability distribution, the so called prior. Using the prior and the likelihood of the data given the parameters, the posterior distribution of the parameters can be derived from Bayes rule. From this posterior, various estimates for functions ofthe parameter, like predictions about unseen data, can be calculated. However, in general, those predictions cannot be realised by specific parameter values, but only by an ensemble average over parameters according to the posterior probability. Hence, exact implementations of Bayes method for neural networks require averages over network parameters which in general can be performed by time consuming 226 M. Opper and O. Winther Monte Carlo procedures.


An Apobayesian Relative of Winnow

Neural Information Processing Systems

We study a mistake-driven variant of an online Bayesian learning algorithm (similar to one studied by Cesa-Bianchi, Helmbold, and Panizza [CHP96]). This variant only updates its state (learns) on trials in which it makes a mistake. The algorithm makes binary classifications using a linear-threshold classifier and runs in time linear in the number of attributes seen by the learner. We have been able to show, theoretically and in simulations, that this algorithm performs well under assumptions quite different from those embodied in the prior of the original Bayesian algorithm. It can handle situations that we do not know how to handle in linear time with Bayesian algorithms. We expect our techniques to be useful in deriving and analyzing other apobayesian algorithms. 1 Introduction We consider two styles of online learning.


Statistically Efficient Estimations Using Cortical Lateral Connections

Neural Information Processing Systems

Coarse codes are widely used throughout the brain to encode sensory and motor variables. Methods designed to interpret these codes, such as population vector analysis, are either inefficient, i.e., the variance of the estimate is much larger than the smallest possible variance, or biologically implausible, like maximum likelihood. Moreover, these methods attempt to compute a scalar or vector estimate of the encoded variable. Neurons are faced with a similar estimation problem. They must read out the responses of the presynaptic neurons, but, by contrast, they typically encode the variable with a further population code rather than as a scalar. We show how a nonlinear recurrent network can be used to perform these estimation in an optimal way while keeping the estimate in a coarse code format. This work suggests that lateral connections in the cortex may be involved in cleaning up uncorrelated noise among neurons representing similar variables.


Learning Exact Patterns of Quasi-synchronization among Spiking Neurons from Data on Multi-unit Recordings

Neural Information Processing Systems

This paper develops arguments for a family of temporal log-linear models to represent spatiotemporal correlations among the spiking events in a group of neurons. The models can represent not just pairwise correlations but also correlations of higher order. Methods are discussed for inferring the existence or absence of correlations and estimating their strength. A frequentist and a Bayesian approach to correlation detection are compared.


Contour Organisation with the EM Algorithm

Neural Information Processing Systems

This paper describes how the early visual process of contour organisation can be realised using the EM algorithm. The underlying computational representation is based on fine spline coverings. According to our EM approach the adjustment of spline parameters draws on an iterative weighted least-squares fitting process. The expectation step of our EM procedure computes the likelihood of the data using a mixture model defined over the set of spline coverings. These splines are limited in their spatial extent using Gaussian windowing functions.


Compositionality, MDL Priors, and Object Recognition

Neural Information Processing Systems

Images are ambiguous at each of many levels of a contextual hierarchy. Nevertheless, the high-level interpretation of most scenes is unambiguous, as evidenced by the superior performance of humans. This observation argues for global vision models, such as deformable templates. Unfortunately, such models are computationally intractable for unconstrained problems. We propose a compositional model in which primitives are recursively composed, subject to syntactic restrictions, to form tree-structured objects and object groupings. Ambiguity is propagated up the hierarchy in the form of multiple interpretations, which are later resolved by a Bayesian, equivalently minimum-description-Iength, cost functional.


Maximum Likelihood Blind Source Separation: A Context-Sensitive Generalization of ICA

Neural Information Processing Systems

We cast the problem as one of maximum likelihood density estimation, and in that framework introduce an algorithm that searches for independent components using both temporal and spatial cues. We call the resulting algorithm "Contextual ICA," after the (Bell and Sejnowski 1995) Infomax algorithm, which we show to be a special case of cICA. Because cICA can make use of the temporal structure of its input, it is able separate in a number of situations where standard methods cannot, including sources with low kurtosis, colored Gaussian sources, and sources which have Gaussian histograms. 1 The Blind Source Separation Problem Consider a set of n indepent sources