Goto

Collaborating Authors

 Directed Networks


Global Coordination of Local Linear Models

Neural Information Processing Systems

High dimensional data that lies on or near a low dimensional manifold can be described bya collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold--arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the "global coordination" ofthese models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model's parameter space, favoring models whose internal coordinate systemsare aligned in a consistent way. As a result, the internal coordinates changesmoothly and continuously as one traverses a connected path on the manifold--even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inferencein intractable probabilistic models, but to learn more useful internal representations in tractable ones.


Infinite Mixtures of Gaussian Process Experts

Neural Information Processing Systems

We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using aninput-dependent adaptation of the Dirichlet Process, we implement agating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets - thus potentially overcoming twoof the biggest hurdles with GP models.



On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

Neural Information Processing Systems

Discriminative classifiers model the posterior p(ylx)directly, or learn a direct map from inputs x to the class labels. There are several compelling reasons for using discriminative rather than generative classifiers, oneof which, succinctly articulated by Vapnik [6], is that "one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(xly)]." Indeed, leaving aside computational issues and matters such as handling missing data, the prevailing consensus seems to be that discriminative classifiers are almost always to be preferred to generative ones. Anotherpiece of prevailing folk wisdom is that the number of examples needed to fit a model is often roughly linear in the number of free parameters of a model. This has its theoretical basis in the observation that for "many" models, the VC dimension is roughly linear or at most some low-order polynomial in the number of parameters (see, e.g., [1, 3]), and it is known that sample complexity in the discriminative setting is linear in the VC dimension [6]. In this paper, we study empirically and theoretically the extent to which these beliefs are true. A parametric family of probabilistic models p(x, y) can be fit either to optimize the joint likelihood of the inputs and the labels, or fit to optimize the conditional likelihood p(ylx), or even fit to minimize the 0-1 training error obtained by thresholding p(ylx) to make predictions.


Latent Dirichlet Allocation

Neural Information Processing Systems

We propose a generative model for text and other collections of discrete datathat generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspectmodel, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical resultson applications of this model to problems in text modeling, collaborative filtering, and text classification. 1 Introduction Recent years have seen the development and successful application of several latent factor models for discrete data. One notable example, Hofmann's pLSI/aspect model [3], has received the attention of many researchers, and applications have emerged in text modeling [3], collaborative filtering [7], and link analysis [1]. In the context of text modeling, pLSI is a "bag-of-words" model in that it ignores the ordering of the words in a document.


Thin Junction Trees

Neural Information Processing Systems

We present an algorithm that induces a class of models with thin junction trees--models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.


Rao-Blackwellised Particle Filtering via Data Augmentation

Neural Information Processing Systems

SMC is often referred to as particle filtering (PF) in the context of computing filtering distributions for statistical inference and learning. It is known that the performance of PF often deteriorates in high-dimensional state spaces. In the past, we have shown that if a model admits partial analytical tractability, it is possible to combine PF with exact algorithms (Kalman filters, HMM filters, junction tree algorithm) to obtain efficient high dimensional filters (Doucet, de Freitas, Murphy and Russell 2000, Doucet, Godsill and Andrieu 2000). In particular, we exploited a marginalisation technique known as Rao-Blackwellisation (RB). Here, we attack a more complex model that does not admit immediate analytical tractability. This probabilistic model consists of Gaussian latent variables and binary observations.We show that by augmenting the model with artificial variables, it becomes possible to apply Rao-Blackwellisation and optimal sampling strategies. We focus on the problem of sequential binary classification (that is, when the data arrives one-at-a-time) using generic classifiers that consist of linear combinations of basis functions, whose coefficients evolve according to a Gaussian smoothness prior (Kitagawa and Gersch 1996). We have previously addressed this problem in the context of sequential fault detection in marine diesel engines (H0jen-S0rensen, de Freitas and Fog 2000). This application is of great importance as early detection of incipient faults can improve safety and efficiency, as well as, help to reduce downtime andplant maintenance in many industrial and transportation environments.


Boosting and Maximum Likelihood for Exponential Models

Neural Information Processing Systems

We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing theexponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized toform a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis andgive additional insight into the relationship between boosting and logistic regression.


Distribution of Mutual Information

Neural Information Processing Systems

The mutual information of two random variables z and J with joint probabilities {7rij} is commonly used in learning Bayesian nets as well as in many other fields. The chances 7rij are usually estimated by the empirical sampling frequency nij In leading to a point estimate J(nijIn) for the mutual information. To answer questions like "is J (nij In) consistent with zero?" or "what is the probability that the true mutual information is much larger than the point estimate?"