Bayesian Inference
Approximate Counting in SMT and Value Estimation for Probabilistic Programs
Chistikov, Dmitry, Dimitrova, Rayna, Majumdar, Rupak
#SMT, or model counting for logical theories, is a well-known hard problem that generalizes such tasks as counting the number of satisfying assignments to a Boolean formula and computing the volume of a polytope. In the realm of satisfiability modulo theories (SMT) there is a growing need for model counting solvers, coming from several application domains (quantitative information flow, static analysis of probabilistic programs). In this paper, we show a reduction from an approximate version of #SMT to SMT. We focus on the theories of integer arithmetic and linear real arithmetic. We propose model counting algorithms that provide approximate solutions with formal bounds on the approximation error. They run in polynomial time and make a polynomial number of queries to the SMT solver for the underlying theory, exploiting "for free" the sophisticated heuristics implemented within modern SMT solvers. We have implemented the algorithms and used them to solve the value problem for a model of loop-free probabilistic programs with nondeterminism.
Parallelizing MCMC with Random Partition Trees
Wang, Xiangyu, Guo, Fangjian, Heller, Katherine A., Dunson, David B.
The modern scale of data has brought new challenges to Bayesian inference. In particular, conventional MCMC algorithms are computationally very expensive for large data sets. A promising approach to solve this problem is embarrassingly parallel MCMC (EP-MCMC), which first partitions the data into multiple subsets and runs independent sampling algorithms on each subset. The subset posterior draws are then aggregated via some combining rules to obtain the final approximation. Existing EP-MCMC algorithms are limited by approximation accuracy and difficulty in resampling. In this article, we propose a new EP-MCMC algorithm PART that solves these problems. The new algorithm applies random partition trees to combine the subset posterior draws, which is distribution-free, easy to resample from and can adapt to multiple scales. We provide theoretical justification and extensive experiments illustrating empirical performance.
On the Statistical Efficiency of $\ell_{1,p}$ Multi-Task Learning of Gaussian Graphical Models
Honorio, Jean, Jaakkola, Tommi, Samaras, Dimitris
We analyze the sufficient number of samples for the correct recovery of the support union and edge signs. We also analyze the necessary number of samples for any conceivable method by providing information-theoretic lower bounds. We compare the statistical efficiency of multi-task learning versus that of single-task learning. For experiments, we use a block coordinate descent method that is provably convergent and generates a sequence of positive definite solutions. We provide experimental validation on synthetic data as well as on two publicly available real-world data sets, including functional magnetic resonance imaging and gene expression data.
Filtering with State-Observation Examples via Kernel Monte Carlo Filter
Kanagawa, Motonobu, Nishiyama, Yu, Gretton, Arthur, Fukumizu, Kenji
This paper addresses the problem of filtering with a state-space model. Standard approaches for filtering assume that a probabilistic model for observations (i.e. the observation model) is given explicitly or at least parametrically. We consider a setting where this assumption is not satisfied; we assume that the knowledge of the observation model is only provided by examples of state-observation pairs. This setting is important and appears when state variables are defined as quantities that are very different from the observations. We propose Kernel Monte Carlo Filter, a novel filtering method that is focused on this setting. Our approach is based on the framework of kernel mean embeddings, which enables nonparametric posterior inference using the state-observation examples. The proposed method represents state distributions as weighted samples, propagates these samples by sampling, estimates the state posteriors by Kernel Bayes' Rule, and resamples by Kernel Herding. In particular, the sampling and resampling procedures are novel in being expressed using kernel mean embeddings, so we theoretically analyze their behaviors. We reveal the following properties, which are similar to those of corresponding procedures in particle methods: (1) the performance of sampling can degrade if the effective sample size of a weighted sample is small; (2) resampling improves the sampling performance by increasing the effective sample size. We first demonstrate these theoretical findings by synthetic experiments. Then we show the effectiveness of the proposed filter by artificial and real data experiments, which include vision-based mobile robot localization.
A Bounded $p$-norm Approximation of Max-Convolution for Sub-Quadratic Bayesian Inference on Additive Factors
Pfeuffer, Julianus, Serang, Oliver
Max-convolution is an important problem closely resembling standard convolution; as such, max-convolution occurs frequently across many fields. Here we extend the method with fastest known worst-case runtime, which can be applied to nonnegative vectors by numerically approximating the Chebyshev norm $\| \cdot \|_\infty$, and use this approach to derive two numerically stable methods based on the idea of computing $p$-norms via fast convolution: The first method proposed, with runtime in $O( k \log(k) \log(\log(k)) )$ (which is less than $18 k \log(k)$ for any vectors that can be practically realized), uses the $p$-norm as a direct approximation of the Chebyshev norm. The second approach proposed, with runtime in $O( k \log(k) )$ (although in practice both perform similarly), uses a novel null space projection method, which extracts information from a sequence of $p$-norms to estimate the maximum value in the vector (this is equivalent to querying a small number of moments from a distribution of bounded support in order to estimate the maximum). The $p$-norm approaches are compared to one another and are shown to compute an approximation of the Viterbi path in a hidden Markov model where the transition matrix is a Toeplitz matrix; the runtime of approximating the Viterbi path is thus reduced from $O( n k^2 )$ steps to $O( n $k \log(k))$ steps in practice, and is demonstrated by inferring the U.S. unemployment rate from the S&P 500 stock index.
Partition MCMC for inference on acyclic digraphs
Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. MCMC methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain's convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and non ergodic edge reversal moves. Here we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally the method can be combined with edge reversal moves to improve the sampler further.
Accelerometer based Activity Classification with Variational Inference on Sticky HDP-SLDS
Basbug, Mehmet Emin, Ozcan, Koray, Velipasalar, Senem
As part of daily monitoring of human activities, wearable sensors and devices are becoming increasingly popular sources of data. With the advent of smartphones equipped with acceloremeter, gyroscope and camera; it is now possible to develop activity classification platforms everyone can use conveniently. In this paper, we propose a fast inference method for an unsupervised non-parametric time series model namely variational inference for sticky HDP-SLDS(Hierarchical Dirichlet Process Switching Linear Dynamical System). We show that the proposed algorithm can differentiate various indoor activities such as sitting, walking, turning, going up/down the stairs and taking the elevator using only the acceloremeter of an Android smartphone Samsung Galaxy S4. We used the front camera of the smartphone to annotate activity types precisely. We compared the proposed method with Hidden Markov Models with Gaussian emission probabilities on a dataset of 10 subjects. We showed that the efficacy of the stickiness property. We further compared the variational inference to the Gibbs sampler on the same model and show that variational inference is faster in one order of magnitude.
A Bayesian alternative to mutual information for the hierarchical clustering of dependent random variables
Marrelec, Guillaume, Messรฉ, Arnaud, Bellec, Pierre
The use of mutual information as a similarity measure in agglomerative hierarchical clustering (AHC) raises an important issue: some correction needs to be applied for the dimensionality of variables. In this work, we formulate the decision of merging dependent multivariate normal variables in an AHC procedure as a Bayesian model comparison. We found that the Bayesian formulation naturally shrinks the empirical covariance matrix towards a matrix set a priori (e.g., the identity), provides an automated stopping rule, and corrects for dimensionality using a term that scales up the measure as a function of the dimensionality of the variables. Also, the resulting log Bayes factor is asymptotically proportional to the plug-in estimate of mutual information, with an additive correction for dimensionality in agreement with the Bayesian information criterion. We investigated the behavior of these Bayesian alternatives (in exact and asymptotic forms) to mutual information on simulated and real data. An encouraging result was first derived on simulations: the hierarchical clustering based on the log Bayes factor outperformed off-the-shelf clustering techniques as well as raw and normalized mutual information in terms of classification accuracy. On a toy example, we found that the Bayesian approaches led to results that were similar to those of mutual information clustering techniques, with the advantage of an automated thresholding. On real functional magnetic resonance imaging (fMRI) datasets measuring brain activity, it identified clusters consistent with the established outcome of standard procedures. On this application, normalized mutual information had a highly atypical behavior, in the sense that it systematically favored very large clusters. These initial experiments suggest that the proposed Bayesian alternatives to mutual information are a useful new tool for hierarchical clustering.
Topic-adjusted visibility metric for scientific articles
Tan, Linda S. L., Chan, Aik Hui, Zheng, Tian
Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an article level metric useful for evaluating individual articles' visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations amongst them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). Our proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. We develop an efficient algorithm for model fitting using variational methods. To scale up to large networks, we develop an online variant using stochastic gradient methods and case-control likelihood approximation. We apply our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers.
Mining Combined Causes in Large Data Sets
Ma, Saisai, Li, Jiuyong, Liu, Lin, Le, Thuc Duy
In recent years, many methods have been developed for detecting causal relationships in observational data. Some of them have the potential to tackle large data sets. However, these methods fail to discover a combined cause, i.e. a multi-factor cause consisting of two or more component variables which individually are not causes. A straightforward approach to uncovering a combined cause is to include both individual and combined variables in the causal discovery using existing methods, but this scheme is computationally infeasible due to the huge number of combined variables. In this paper, we propose a novel approach to address this practical causal discovery problem, i.e. mining combined causes in large data sets. The experiments with both synthetic and real world data sets show that the proposed method can obtain high-quality causal discoveries with a high computational efficiency.