A practical method for Bayesian training of feed-forward neural networks using sophisticated Monte Carlo methods is presented and evaluated. In reasonably small amounts of computer time this approach outperforms other state-of-the-art methods on 5 datalimited tasksfrom real world domains. 1 INTRODUCTION Bayesian learning uses a prior on model parameters, combines this with information from a training set, and then integrates over the resulting posterior to make predictions. Withthis approach, we can use large networks without fear of overfitting, allowing us to capture more structure in the data, thus improving prediction accuracy andeliminating the tedious search (often performed using cross validation) for the model complexity that optimises the bias/variance tradeoff. In this approach the size of the model is limited only by computational considerations. The application of Bayesian learning to neural networks has been pioneered by MacKay (1992), who uses a Gaussian approximation to the posterior weight distribution.
The full Bayesian method for applying neural networks to a prediction problemis to set up the prior/hyperprior structure for the net and then perform the necessary integrals. However, these integrals arenot tractable analytically, and Markov Chain Monte Carlo (MCMC) methods are slow, especially if the parameter space is high-dimensional. Using Gaussian processes we can approximate the weight space integral analytically, so that only a small number of hyperparameters need be integrated over by MCMC methods. We have applied this idea to classification problems, obtaining excellent resultson the real-world problems investigated so far. 1 INTRODUCTION To make predictions based on a set of training data, fundamentally we need to combine our prior beliefs about possible predictive functions with the data at hand. In the Bayesian approach to neural networks a prior on the weights in the net induces a prior distribution over functions.
The prior can be obtained by placing prior distributions on the weights in a neural 494 P. W Goldberg, C. K. L Williams and C. M. Bishop network, although we would argue that it is perhaps more natural to place priors directly overfunctions. One tractable way of doing this is to create a Gaussian process prior. This has the advantage that predictions can be made from the posterior using only matrix multiplication for fixed hyperparameters and a global noise level. In contrast, for neural networks (with fixed hyperparameters and a global noise level) it is necessary to use approximations or Markov chain Monte Carlo (MCMC) methods. Rasmussen(1996) has demonstrated that predictions obtained with Gaussian processes are as good as or better than other state-of-the art predictors. In much of the work on regression problems in the statistical and neural networks literatures, it is assumed that there is a global noise level, independent of the input vector x. The book by Bishop (1995) and the papers by Bishop (1994), MacKay (1995) and Bishop and Qazaz (1997) have examined the case of input-dependent noise for parametric models such as neural networks.
We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation ofprior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensionalintegrals involved in computing marginal likelihoods ofstatistical models (a.k.a.
We discuss the application of TAP mean field methods known from the Statistical Mechanics of disordered systems to Bayesian classification modelswith Gaussian processes. In contrast to previous approaches, noknowledge about the distribution of inputs is needed. Simulation results for the Sonar data set are given. They have been recently introduced into the Neural Computation community (Neal 1996, Williams & Rasmussen 1996, Mackay 1997). If we assume fields with zero prior mean, the statistics of h is entirely defined by the second order correlations C(s, S') E[h(s)h(S')], where E denotes expectations 310 MOpper and 0. Winther with respect to the prior.