Bayesian Learning
Probabilistic Abstraction Hierarchies
Segal, Eran, Koller, Daphne, Ormoneit, Dirk
Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in "nearby" classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data.
Global Coordination of Local Linear Models
Roweis, Sam T., Saul, Lawrence K., Hinton, Geoffrey E.
High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold--arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the "global coordination" of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model's parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold--even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones.
Multiplicative Updates for Classification by Mixture Models
Saul, Lawrence K., Lee, Daniel D.
We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm--its guarantee of monotonic improvement, and its absence of tuning parameters--with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM.
Infinite Mixtures of Gaussian Process Experts
Rasmussen, Carl E., Ghahramani, Zoubin
We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets - thus potentially overcoming two of the biggest hurdles with GP models.
MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation
Rangarajan, Anand, Yuille, Alan L.
Bayesian belief propagation in graphical models has been recently shown to have very close ties to inference methods based in statistical physics. After Yedidia et al. demonstrated that belief propagation fixed points correspond to extrema of the so-called Bethe free energy, Yuille derived a double loop algorithm that is guaranteed to converge to a local minimum of the Bethe free energy. Yuille's algorithm is based on a certain decomposition of the Bethe free energy and he mentions that other decompositions are possible and may even be fruitful. In the present work, we begin with the Bethe free energy and show that it has a principled interpretation as pairwise mutual information minimization and marginal entropy maximization (MIME). Next, we construct a family of free energy functions from a spectrum of decompositions of the original Bethe free energy. For each free energy in this family, we develop a new algorithm that is guaranteed to converge to a local minimum. Preliminary computer simulations are in agreement with this theoretical development.
On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
Ng, Andrew Y., Jordan, Michael I.
Discriminative classifiers model the posterior p(ylx) directly, or learn a direct map from inputs x to the class labels. There are several compelling reasons for using discriminative rather than generative classifiers, one of which, succinctly articulated by Vapnik [6], is that "one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(xly)]." Indeed, leaving aside computational issues and matters such as handling missing data, the prevailing consensus seems to be that discriminative classifiers are almost always to be preferred to generative ones. Another piece of prevailing folk wisdom is that the number of examples needed to fit a model is often roughly linear in the number of free parameters of a model. This has its theoretical basis in the observation that for "many" models, the VC dimension is roughly linear or at most some low-order polynomial in the number of parameters (see, e.g., [1, 3]), and it is known that sample complexity in the discriminative setting is linear in the VC dimension [6]. In this paper, we study empirically and theoretically the extent to which these beliefs are true. A parametric family of probabilistic models p(x, y) can be fit either to optimize the joint likelihood of the inputs and the labels, or fit to optimize the conditional likelihood p(ylx), or even fit to minimize the 0-1 training error obtained by thresholding p(ylx) to make predictions.
KLD-Sampling: Adaptive Particle Filters
Over the last years, particle filters have been applied with great success to a variety of state estimation problems. We present a statistical approach to increasing the efficiency of particle filters by adapting the size of sample sets on-the-fly. The key idea of the KLD-sampling method is to bound the approximation error introduced by the sample-based representation of the particle filter. The name KLD-sampling is due to the fact that we measure the approximation error by the Kullback-Leibler distance. Our adaptation approach chooses a small number of samples if the density is focused on a small part of the state space, and it chooses a large number of samples if the state uncertainty is high. Both the implementation and computation overhead of this approach are small. Extensive experiments using mobile robot localization as a test application show that our approach yields drastic improvements over particle filters with fixed sample set sizes and over a previously introduced adaptation technique.
Adaptive Sparseness Using Jeffreys Prior
In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
TAP Gibbs Free Energy, Belief Propagation and Sparsity
Csató, Lehel, Opper, Manfred, Winther, Ole
The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka's expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.
Latent Dirichlet Allocation
Blei, David M., Ng, Andrew Y., Jordan, Michael I.
We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms.