Bayesian Inference
Exact MAP Estimates by (Hyper)tree Agreement
Wainwright, Martin J., Jaakkola, Tommi S., Willsky, Alan S.
We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. An important implication is that any such shared configuration must also be the MAP configuration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. We give necessary and sufficient conditions for a fixed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.
VIBES: A Variational Inference Engine for Bayesian Networks
Bishop, Christopher M., Spiegelhalter, David, Winn, John
In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES ('Variational Inference for Bayesian Networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling.
Regularized Greedy Importance Sampling
Southey, Finnegan, Schuurmans, Dale, Ghodsi, Ali
Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. The resulting estimator is shown to be effective for difficult estimation problems arising in Markov random field inference. In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes.
Adaptive Classification by Variational Kalman Filtering
Sykacek, Peter, Roberts, Stephen J.
We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classification that combines the computational advantage of a parametric solution with the flexibility of sequential sampling techniques. We regard the parameters of the classifier as latent states in a first order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman filtering. The variational Kalman filter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classifiers both in stationary and non-stationary environments. Although we focus on classification, the algorithm is easily extended to other generalized nonlinear models.
Dynamic Bayesian Networks with Deterministic Latent Tables
The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisation has the advantage of tractable inference even for highly complex nonlinear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the benefits of a tractable probabilistic model.
On the Dirichlet Prior and Bayesian Regularization
Steck, Harald, Jaakkola, Tommi S.
In the Bayesian approach, regularization is achieved by specifying a prior distribution over the parameters and subsequently averaging over the posterior distribution. This regularization provides not only smoother estimates of the parameters compared to maximum likelihood but also guides the selection of model structures. It was pointed out in [6] that a very large scale parameter of the Dirichlet prior can degrade predictive accuracy due to severe regularization of the parameter estimates. We complement this discussion here and show that a very small scale parameter can lead to poor over-regularized structures when a product of (conjugate) Dirichlet priors is used over multinomial conditional distributions (Section 3). Section 4 demonstrates the effect of the scale parameter and how it can be calibrated. We focus on the class of Bayesian network models throughout this paper.
Independent Components Analysis through Product Density Estimation
Hastie, Trevor, Tibshirani, Rob
We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. Since our estimates have two continuous derivatives, we can easily run a second order search for the frame parameters. Our method performs very favorably when compared to state-of-the-art techniques. 1 Introduction Independent component analysis (ICA) is a popular enhancement over principal component analysis (PCA) and factor analysis. IRP which is assumed to arise from a linear mixing of a latent random source vector S E IRP, (1) X AS; the components Sj, j 1,...,p of S are assumed to be independently distributed.
Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Herbrich, Ralf, Lawrence, Neil D., Seeger, Matthias
We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d-sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements.
Bayesian Monte Carlo
Ghahramani, Zoubin, Rasmussen, Carl E.
We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a.
The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities
Watanabe, Sumio, Amari, Shun-ichi
A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information.