Goto

Collaborating Authors

 Directed Networks


Robust Full Bayesian Methods for Neural Networks

Neural Information Processing Systems

In particular, Mackay showed that by approximating the distributions of the weights with Gaussians and adopting smoothing priors, it is possible to obtain estimates of the weights and output variances and to automatically set the regularisation coefficients. Neal (1996) cast the net much further by introducing advanced Bayesian simulation methods, specifically the hybrid Monte Carlo method, into the analysis of neural networks [3]. Bayesian sequential Monte Carlo methods have also been shown to provide good training results, especially in time-varying scenarios [4]. More recently, Rios Insua and Muller (1998) and Holmes and Mallick (1998) have addressed the issue of selecting the number of hidden neurons with growing and pruning algorithms from a Bayesian perspective [5,6]. In particular, they apply the reversible jump Markov Chain Monte Carlo (MCMC) algorithm of Green [7] to feed-forward sigmoidal networks and radial basis function (RBF) networks to obtain joint estimates of the number of neurons and weights. We also apply the reversible jump MCMC simulation algorithm to RBF networks so as to compute the joint posterior distribution of the radial basis parameters and the number of basis functions. However, we advance this area of research in two important directions. Firstly, we propose a full hierarchical prior for RBF networks.


Mixture Density Estimation

Neural Information Processing Systems

Gaussian mixtures (or so-called radial basis function networks) for density estimation provide a natural counterpart to sigmoidal neural networks for function fitting and approximation. In both cases, it is possible to give simple expressions for the iterative improvement of performance as components of the network are introduced one at a time. In particular, for mixture density estimation we show that a k-component mixture estimated by maximum likelihood (or by an iterative likelihood improvement that we introduce) achieves log-likelihood within order 1/k of the log-likelihood achievable by any convex combination. Consequences for approximation and estimation using Kullback-Leibler risk are also given. A Minimum Description Length principle selects the optimal number of components k that minimizes the risk bound. 1 Introduction In density estimation, Gaussian mixtures provide flexible-basis representations for densities that can be used to model heterogeneous data in high dimensions. Consider a parametric family G { pe(x), x E X C Rd': fJ E The main theme of the paper is to give approximation and estimation bounds of arbitrary densities by finite mixture densities.


A Variational Baysian Framework for Graphical Models

Neural Information Processing Systems

This paper presents a novel practical framework for Bayesian model averaging and model selection in probabilistic graphical models. Our approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner. These posteriors fall out of a free-form optimization procedure, which naturally incorporates conjugate priors. Unlike in large sample approximations, the posteriors are generally non Gaussian and no Hessian needs to be computed. Predictive quantities are obtained analytically. The resulting algorithm generalizes the standard Expectation Maximization algorithm, and its convergence is guaranteed. We demonstrate that this approach can be applied to a large class of models in several domains, including mixture models and source separation. 1 Introduction


Population Decoding Based on an Unfaithful Model

Neural Information Processing Systems

We study a population decoding paradigm in which the maximum likelihood inference is based on an unfaithful decoding model (UMLI). This is usually the case for neural population decoding because the encoding process of the brain is not exactly known, or because a simplified decoding model is preferred for saving computational cost. We consider an unfaithful decoding model which neglects the pairwise correlation between neuronal activities, and prove that UMLI is asymptotically efficient when the neuronal correlation is uniform or of limited-range. The performance of UMLI is compared with that of the maximum likelihood inference based on a faithful model and that of the center of mass decoding method. It turns out that UMLI has advantages of decreasing the computational complexity remarkablely and maintaining a high-level decoding accuracy at the same time. The effect of correlation on the decoding accuracy is also discussed.


Rules and Similarity in Concept Learning

Neural Information Processing Systems

This paper argues that two apparently distinct modes of generalizing concepts - abstracting rules and computing similarity to exemplars - should both be seen as special cases of a more general Bayesian learning framework. Bayes explains the specific workings of these two modes - which rules are abstracted, how similarity is measured - as well as why generalization should appear rule-or similarity-based in different situations. This analysis also suggests why the rules/similarity distinction, even if not computationally fundamental, may still be useful at the algorithmic level as part of a principled approximation to fully Bayesian learning.


A Variational Baysian Framework for Graphical Models

Neural Information Processing Systems

This paper presents a novel practical framework for Bayesian model averaging and model selection in probabilistic graphical models. Our approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner.These posteriors fall out of a free-form optimization procedure, which naturally incorporates conjugate priors. Unlike in large sample approximations, the posteriors are generally non Gaussian and no Hessian needs to be computed.


Bayesian Modelling of fMRI lime Series

Neural Information Processing Systems

We present a Hidden Markov Model (HMM) for inferring the hidden psychological state (or neural activity) during single trial tMRI activation experimentswith blocked task paradigms. Inference is based on Bayesian methodology, using a combination of analytical and a variety of Markov Chain Monte Carlo (MCMC) sampling techniques. The advantage ofthis method is that detection of short time learning effects between repeated trials is possible since inference is based only on single trial experiments.


Greedy Importance Sampling

Neural Information Processing Systems

I present a simple variation of importance sampling that explicitly searches forimportant regions in the target distribution. I prove that the technique yieldsunbiased estimates, and show empirically it can reduce the variance of standard Monte Carlo estimators. This is achieved by concentrating samplesin more significant regions of the sample space. 1 Introduction It is well known that general inference and learning with graphical models is computationally hard[1] and it is therefore necessary to consider restricted architectures [13], or approximate algorithms to perform these tasks [3, 7]. Among the most convenient and successful techniques are stochastic methods which are guaranteed to converge to a correct solution in the limit oflarge samples [10, 11, 12, 15]. These methods can be easily applied to complex inference problems that overwhelm deterministic approaches.


The Infinite Gaussian Mixture Model

Neural Information Processing Systems

In a Bayesian mixture model it is not necessary a priori to limit the number ofcomponents to be finite. In this paper an infinite Gaussian mixture model is presented which neatly sidesteps the difficult problem of finding the"right" number of mixture components. Inference in the model is done using an efficient parameter-free Markov Chain that relies entirely on Gibbs sampling.


Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Neural Information Processing Systems

The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially.In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of variables, usinga multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables,but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained bypruning the network. 1 Introduction The curse of dimensionality hits particularly hard on models of high-dimensional discrete data because there are many more possible combinations of the values of the variables than can possibly be observed in any data set, even the large data sets now common in datamining applications.In this paper we are dealing in particular with multivariate discrete data, where one tries to build a model of the distribution of the data. This can be used for example to detect anomalous cases in data-mining applications, or it can be used to model the class-conditional distribution of some observed variables in order to build a classifier. A simple multinomial maximum likelihood model would give zero probability to all of the combinations not encountered in the training set, i.e., it would most likely give zero probability to most out-of-sample test cases. Smoothing the model by assigning the same nonzero probability for all the unobserved cases would not be satisfactory either because it would not provide much generalization from the training set. This could be obtained by using a multivariate multinomial model whose parameters Bare estimated by the maximum a-posteriori (MAP) principle, i.e., those that have the greatest probability, given the training data D, and using a diffuse prior PCB) (e.g.