Bayesian Learning
The Infinite Gaussian Mixture Model
In a Bayesian mixture model it is not necessary a priori to limit the number of components to be finite. In this paper an infinite Gaussian mixture model is presented which neatly sidesteps the difficult problem of finding the "right" number of mixture components. Inference in the model is done using an efficient parameter-free Markov Chain that relies entirely on Gibbs sampling.
Algorithms for Independent Components Analysis and Higher Order Statistics
Lee, Daniel D., Rokni, Uri, Sompolinsky, Haim
A latent variable generative model with finite noise is used to describe several different algorithms for Independent Components Analysis (lCA). In particular, the Fixed Point ICA algorithm is shown to be equivalent to the Expectation-Maximization algorithm for maximum likelihood under certain constraints, allowing the conditions for global convergence to be elucidated. The algorithms can also be explained by their generic behavior near a singular point where the size of the optimal generative bases vanishes. An expansion of the likelihood about this singular point indicates the role of higher order correlations in determining the features discovered by ICA. The application and convergence of these algorithms are demonstrated on a simple illustrative example.
Maximum Entropy Discrimination
Jaakkola, Tommi, Meila, Marina, Jebara, Tony
We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed under this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of class-conditional models within this framework.
Variational Inference for Bayesian Mixtures of Factor Analysers
Ghahramani, Zoubin, Beal, Matthew J.
Zoubin Ghahramani and Matthew J. Beal Gatsby Computational Neuroscience Unit University College London 17 Queen Square, London WC1N 3AR, England {zoubin,m.beal}Ggatsby.ucl.ac.uk Abstract We present an algorithm that infers the model structure of a mixture of factor analysers using an efficient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (Le. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out the method is not prone to overfitting. Using a stochastic procedure for adding components it is possible to perform the variational optimisation incrementally and to avoid local maxima.
The Nonnegative Boltzmann Machine
Downs, Oliver B., MacKay, David J. C., Lee, Daniel D.
The nonnegative Boltzmann machine (NNBM) is a recurrent neural network model that can describe multimodal nonnegative data. Application of maximum likelihood estimation to this model gives a learning rule that is analogous to the binary Boltzmann machine. We examine the utility of the mean field approximation for the NNBM, and describe how Monte Carlo sampling techniques can be used to learn its parameters. Reflective slice sampling is particularly well-suited for this distribution, and can efficiently be implemented to sample the distribution. We illustrate learning of the NNBM on a transiationally invariant distribution, as well as on a generative model for images of human faces. Introduction The multivariate Gaussian is the most elementary distribution used to model generic data. It represents the maximum entropy distribution under the constraint that the mean and covariance matrix of the distribution match that of the data. For the case of binary data, the maximum entropy distribution that matches the first and second order statistics of the data is given by the Boltzmann machine [1].
Robust Neural Network Regression for Offline and Online Learning
Briegel, Thomas, Tresp, Volker
Although one can derive the Gaussian noise assumption based on a maximum entropy approach, the main reason for this assumption is practicability: under the Gaussian noise assumption the maximum likelihood parameter estimate can simply be found by minimization of the squared error. Despite its common use it is far from clear that the Gaussian noise assumption is a good choice for many practical problems. A reasonable approach therefore would be a noise distribution which contains the Gaussian as a special case but which has a tunable parameter that allows for more flexible distributions.
Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of variables, using a multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network. 1 Introduction The curse of dimensionality hits particularly hard on models of high-dimensional discrete data because there are many more possible combinations of the values of the variables than can possibly be observed in any data set, even the large data sets now common in datamining applications.
Independent Factor Analysis with Temporally Structured Sources
We present a new technique for time series analysis based on dynamic probabilistic networks. In this approach, the observed data are modeled in terms of unobserved, mutually independent factors, as in the recently introduced technique of Independent Factor Analysis (IFA). However, unlike in IFA, the factors are not Li.d.; each factor has its own temporal statistical characteristics. We derive a family of EM algorithms that learn the structure of the underlying factors and their relation to the data. These algorithms perform source separation and noise reduction in an integrated manner, and demonstrate superior performance compared to IFA. 1 Introduction The technique of independent factor analysis (IFA) introduced in [1] provides a tool for modeling L'-dim data in terms of L unobserved factors. These factors are mutually independent and combine linearly with added noise to produce the observed data.
Mixture Density Estimation
Li, Jonathan Q., Barron, Andrew R.
Gaussian mixtures (or so-called radial basis function networks) for density estimation provide a natural counterpart to sigmoidal neural networks for function fitting and approximation. In both cases, it is possible to give simple expressions for the iterative improvement of performance as components of the network are introduced one at a time. In particular, for mixture density estimation we show that a k-component mixture estimated by maximum likelihood (or by an iterative likelihood improvement that we introduce) achieves log-likelihood within order 1/k of the log-likelihood achievable by any convex combination. Consequences for approximation and estimation using Kullback-Leibler risk are also given. A Minimum Description Length principle selects the optimal number of components k that minimizes the risk bound. 1 Introduction In density estimation, Gaussian mixtures provide flexible-basis representations for densities that can be used to model heterogeneous data in high dimensions. Consider a parametric family G { pe(x), x E X C Rd': fJ E The main theme of the paper is to give approximation and estimation bounds of arbitrary densities by finite mixture densities.