Bayesian Inference
Functional Gaussian Process Model for Bayesian Nonparametric Analysis
Duan, Leo L., Wang, Xia, Szczesniak, Rhonda D.
Gaussian process is a theoretically appealing model for nonparametric analysis, but its computational cumbersomeness hinders its use in large scale and the existing reduced-rank solutions are usually heuristic. In this work, we propose a novel construction of Gaussian process as a projection from fixed discrete frequencies to any continuous location. This leads to a valid stochastic process that has a theoretic support with the reduced rank in the spectral density, as well as a high-speed computing algorithm. Our method provides accurate estimates for the covariance parameters and concise form of predictive distribution for spatial prediction. For non-stationary data, we adopt the mixture framework with a customized spectral dependency structure. This enables clustering based on local stationarity, while maintains the joint Gaussianness. Our work is directly applicable in solving some of the challenges in the spatial data, such as large scale computation, anisotropic covariance, spatio-temporal modeling, etc. We illustrate the uses of the model via simulations and an application on a massive dataset.
Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions
Shah, Amar, Ghahramani, Zoubin
We develop parallel predictive entropy search (PPES), a novel algorithm for Bayesian optimization of expensive black-box objective functions. At each iteration, PPES aims to select a batch of points which will maximize the information gain about the global maximizer of the objective. Well known strategies exist for suggesting a single evaluation point based on previous observations, while far fewer are known for selecting batches of points to evaluate in parallel. The few batch selection schemes that have been studied all resort to greedy methods to compute an optimal batch. To the best of our knowledge, PPES is the first non-greedy batch Bayesian optimization strategy. We demonstrate the benefit of this approach in optimization performance on both synthetic and real world applications, including problems in machine learning, rocket science and robotics.
Bayesian Evidence and Model Selection
Knuth, Kevin H., Habeck, Michael, Malakar, Nabin K., Mubeen, Asim M., Placek, Ben
In this paper we review the concepts of Bayesian evidence and Bayes factors, also known as log odds ratios, and their application to model selection. The theory is presented along with a discussion of analytic, approximate and numerical techniques. Specific attention is paid to the Laplace approximation, variational Bayes, importance sampling, thermodynamic integration, and nested sampling and its recent variants. Analogies to statistical physics, from which many of these techniques originate, are discussed in order to provide readers with deeper insights that may lead to new techniques. The utility of Bayesian model testing in the domain sciences is demonstrated by presenting four specific practical examples considered within the context of signal processing in the areas of signal detection, sensor characterization, scientific model selection and molecular force characterization.
Fast Parallel SAME Gibbs Sampling on General Discrete Bayesian Networks
Seita, Daniel, Chen, Haoyu, Canny, John
A fundamental task in machine learning and related fields is to perform inference on Bayesian networks. Since exact inference takes exponential time in general, a variety of approximate methods are used. Gibbs sampling is one of the most accurate approaches and provides unbiased samples from the posterior but it has historically been too expensive for large models. In this paper, we present an optimized, parallel Gibbs sampler augmented with state replication (SAME or State Augmented Marginal Estimation) to decrease convergence time. We find that SAME can improve the quality of parameter estimates while accelerating convergence. Experiments on both synthetic and real data show that our Gibbs sampler is substantially faster than the state of the art sampler, JAGS, without sacrificing accuracy. Our ultimate objective is to introduce the Gibbs sampler to researchers in many fields to expand their range of feasible inference problems.
Factorization, Inference and Parameter Learning in Discrete AMP Chain Graphs
We address some computational issues that may hinder the use of AMP chain graphs in practice. Specifically, we show how a discrete probability distribution that satisfies all the independencies represented by an AMP chain graph factorizes according to it. We show how this factorization makes it possible to perform inference and parameter learning efficiently, by adapting existing algorithms for Markov and Bayesian networks. Finally, we turn our attention to another issue that may hinder the use of AMP CGs, namely the lack of an intuitive interpretation of their edges. We provide one such interpretation.
Stochastic Expectation Propagation
Li, Yingzhen, Hernandez-Lobato, Jose Miguel, Turner, Richard E.
Expectation propagation (EP) is a deterministic approximation algorithm that is often used to perform approximate Bayesian parameter learning. EP approximates the full intractable posterior distribution through a set of local approximations that are iteratively refined for each datapoint. EP can offer analytic and computational advantages over other approximations, such as Variational Inference (VI), and is the method of choice for a number of models. The local nature of EP appears to make it an ideal candidate for performing Bayesian learning on large models in large-scale dataset settings. However, EP has a crucial limitation in this context: the number of approximating factors needs to increase with the number of data-points, N, which often entails a prohibitively large memory overhead. This paper presents an extension to EP, called stochastic expectation propagation (SEP), that maintains a global posterior approximation (like VI) but updates it in a local way (like EP). Experiments on a number of canonical learning problems using synthetic and real-world datasets indicate that SEP performs almost as well as full EP, but reduces the memory consumption by a factor of $N$. SEP is therefore ideally suited to performing approximate Bayesian learning in the large model, large dataset setting.
Vertex nomination schemes for membership prediction
Fishkind, D. E., Lyzinski, V., Pao, H., Chen, L., Priebe, C. E.
Suppose that a graph is realized from a stochastic block model where one of the blocks is of interest, but many or all of the vertices' block labels are unobserved. The task is to order the vertices with unobserved block labels into a "nomination list" such that, with high probability, vertices from the interesting block are concentrated near the list's beginning. We propose several vertex nomination schemes. Our basic--but principled--setting and development yields a best nomination scheme (which is a Bayes-Optimal analogue), and also a likelihood maximization nomination scheme that is practical to implement when there are a thousand vertices, and which is empirically near-optimal when the number of vertices is small enough to allow comparison to the best nomination scheme. We then illustrate the robustness of the likelihood maximization nomination scheme to the modeling challenges inherent in real data, using examples which include a social network involving human trafficking, the Enron Graph, a worm brain connectome and a political blog network. In a stochastic block model, the vertices of the graph are partitioned into blocks, and the existence/nonexistence of an edge between any pair of vertices is an independent Bernoulli trial, with the Bernoulli parameter being a function of the block memberships of the pair of vertices. We are concerned here with a graph realized from a stochastic block model such that many or all of the vertices' block labels are hidden (i.e., unobserved). Received August 2014; revised February 2015. Supported in part by Johns Hopkins University Human Language Technology Center of Excellence (JHU HLT COE) and the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750-12-2-0303.
Accelerating pseudo-marginal Metropolis-Hastings by correlating auxiliary variables
Dahlin, Johan, Lindsten, Fredrik, Kronander, Joel, Schรถn, Thomas B.
Pseudo-marginal Metropolis-Hastings (pmMH) is a powerful method for Bayesian inference in models where the posterior distribution is analytical intractable or computationally costly to evaluate directly. It operates by introducing additional auxiliary variables into the model and form an extended target distribution, which then can be evaluated point-wise. In many cases, the standard Metropolis-Hastings is then applied to sample from the extended target and the sought posterior can be obtained by marginalisation. However, in some implementations this approach suffers from poor mixing as the auxiliary variables are sampled from an independent proposal. We propose a modification to the pmMH algorithm in which a Crank-Nicolson (CN) proposal is used instead. This results in that we introduce a positive correlation in the auxiliary variables. We investigate how to tune the CN proposal and its impact on the mixing of the resulting pmMH sampler. The conclusion is that the proposed modification can have a beneficial effect on both the mixing of the Markov chain and the computational cost for each iteration of the pmMH algorithm.
Resolving the Geometric Locus Dilemma for Support Vector Learning Machines
Capacity control, the bias/variance dilemma, and learning unknown functions from data, are all concerned with identifying effective and consistent fits of unknown geometric loci to random data points. A geometric locus is a curve or surface formed by points, all of which possess some uniform property. A geometric locus of an algebraic equation is the set of points whose coordinates are solutions of the equation. Any given curve or surface must pass through each point on a specified locus. This paper argues that it is impossible to fit random data points to algebraic equations of partially configured geometric loci that reference arbitrary Cartesian coordinate systems. It also argues that the fundamental curve of a linear decision boundary is actually a principal eigenaxis. It is shown that learning principal eigenaxes of linear decision boundaries involves finding a point of statistical equilibrium for which eigenenergies of principal eigenaxis components are symmetrically balanced with each other. It is demonstrated that learning linear decision boundaries involves strong duality relationships between a statistical eigenlocus of principal eigenaxis components and its algebraic forms, in primal and dual, correlated Hilbert spaces. Locus equations are introduced and developed that describe principal eigen-coordinate systems for lines, planes, and hyperplanes. These equations are used to introduce and develop primal and dual statistical eigenlocus equations of principal eigenaxes of linear decision boundaries. Important generalizations for linear decision boundaries are shown to be encoded within a dual statistical eigenlocus of principal eigenaxis components. Principal eigenaxes of linear decision boundaries are shown to encode Bayes' likelihood ratio for common covariance data and a robust likelihood ratio for all other data.
Probabilistic Segmentation via Total Variation Regularization
We present a convex approach to probabilistic segmentation and modeling of time series data. Our approach builds upon recent advances in multivariate total variation regularization, and seeks to learn a separate set of parameters for the distribution over the observations at each time point, but with an additional penalty that encourages the parameters to remain constant over time. We propose efficient optimization methods for solving the resulting (large) optimization problems, and a two-stage procedure for estimating recurring clusters under such models, based upon kernel density estimation. Finally, we show on a number of real-world segmentation tasks, the resulting methods often perform as well or better than existing latent variable models, while being substantially easier to train. 1 Introduction In this paper, we consider the tasks of time series segmentation and modeling. Formally, suppose that we observe a sequence ofT input/output pairs, represented as (x 1,y 1), (x 2,y 2),..., (x T,y T) (1) forx t R n (which can even include functions of past outputs of the time series to capture scenarios such as autoregressive models) andy t R p (though we can also consider other possible forms of the output vector, such as categorical variables).