Goto

Collaborating Authors

 Bayesian Inference


Bayesian time series classification

Neural Information Processing Systems

This paper proposes an approach to classification of adjacent segments of a time series as being either of classes. We use a hierarchical model that consists of a feature extraction stage and a generative classifier which is built on top of these features. Such two stage approaches are often used in signal and image processing. The novel part of our work is that we link these stages probabilistically by using a latent feature space. To use one joint model is a Bayesian requirement, which has the advantage to fuse information according to its certainty.


Probabilistic Abstraction Hierarchies

Neural Information Processing Systems

Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in "nearby" classes in the taxonomy are similar. In this paper, weprovide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic modelfrom which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, themodels associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerativeclustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data.


Multiplicative Updates for Classification by Mixture Models

Neural Information Processing Systems

We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm--its guarantee of monotonic improvement, andits absence of tuning parameters--with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special caseto the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments showthat discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM.


Global Coordination of Local Linear Models

Neural Information Processing Systems

High dimensional data that lies on or near a low dimensional manifold can be described bya collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold--arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the "global coordination" ofthese models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model's parameter space, favoring models whose internal coordinate systemsare aligned in a consistent way. As a result, the internal coordinates changesmoothly and continuously as one traverses a connected path on the manifold--even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inferencein intractable probabilistic models, but to learn more useful internal representations in tractable ones.


Infinite Mixtures of Gaussian Process Experts

Neural Information Processing Systems

We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using aninput-dependent adaptation of the Dirichlet Process, we implement agating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets - thus potentially overcoming twoof the biggest hurdles with GP models.



Rao-Blackwellised Particle Filtering via Data Augmentation

Neural Information Processing Systems

SMC is often referred to as particle filtering (PF) in the context of computing filtering distributions for statistical inference and learning. It is known that the performance of PF often deteriorates in high-dimensional state spaces. In the past, we have shown that if a model admits partial analytical tractability, it is possible to combine PF with exact algorithms (Kalman filters, HMM filters, junction tree algorithm) to obtain efficient high dimensional filters (Doucet, de Freitas, Murphy and Russell 2000, Doucet, Godsill and Andrieu 2000). In particular, we exploited a marginalisation technique known as Rao-Blackwellisation (RB). Here, we attack a more complex model that does not admit immediate analytical tractability. This probabilistic model consists of Gaussian latent variables and binary observations.We show that by augmenting the model with artificial variables, it becomes possible to apply Rao-Blackwellisation and optimal sampling strategies. We focus on the problem of sequential binary classification (that is, when the data arrives one-at-a-time) using generic classifiers that consist of linear combinations of basis functions, whose coefficients evolve according to a Gaussian smoothness prior (Kitagawa and Gersch 1996). We have previously addressed this problem in the context of sequential fault detection in marine diesel engines (H0jen-S0rensen, de Freitas and Fog 2000). This application is of great importance as early detection of incipient faults can improve safety and efficiency, as well as, help to reduce downtime andplant maintenance in many industrial and transportation environments.



Boosting and Maximum Likelihood for Exponential Models

Neural Information Processing Systems

We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing theexponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized toform a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis andgive additional insight into the relationship between boosting and logistic regression.