Goto

Collaborating Authors

 Bayesian Inference


Probability Theory without Bayes' Rule

arXiv.org Artificial Intelligence

Within the Kolmogorov theory of probability, Bayes' rule allows one to perform statistical inference by relating conditional probabilities to unconditional probabilities. As we show here, however, there is a continuous set of alternative inference rules that yield the same results, and that may have computational or practical advantages for certain problems. We formulate generalized axioms for probability theory, according to which the reverse conditional probability distribution P(B|A) is not specified by the forward conditional probability distribution P(A|B) and the marginals P(A) and P(B). Thus, in order to perform statistical inference, one must specify an additional "inference axiom," which relates P(B|A) to P(A|B), P(A), and P(B). We show that when Bayes' rule is chosen as the inference axiom, the axioms are equivalent to the classical Kolmogorov axioms. We then derive consistency conditions on the inference axiom, and thereby characterize the set of all possible rules for inference. The set of "first-order" inference axioms, defined as the set of axioms in which P(B|A) depends on the first power of P(A|B), is found to be a 1-simplex, with Bayes' rule at one of the extreme points. The other extreme point, the "inversion rule," is studied in depth.


CAM: Causal additive models, high-dimensional order search and penalized regression

arXiv.org Machine Learning

We develop estimation for potentially high-dimensional additive structural equation models. A key component of our approach is to decouple order search among the variables from feature or edge selection in a directed acyclic graph encoding the causal structure. We show that the former can be done with nonregularized (restricted) maximum likelihood estimation while the latter can be efficiently addressed using sparse regression techniques. Thus, we substantially simplify the problem of structure search and estimation for an important class of causal models. We establish consistency of the (restricted) maximum likelihood estimator for low- and high-dimensional scenarios, and we also allow for misspecification of the error distribution. Furthermore, we develop an efficient computational algorithm which can deal with many variables, and the new method's accuracy and performance is illustrated on simulated and real data.


Efficiently learning Ising models on arbitrary graphs

arXiv.org Machine Learning

We consider the problem of reconstructing the graph underlying an Ising model from i.i.d. samples. Over the last fifteen years this problem has been of significant interest in the statistics, machine learning, and statistical physics communities, and much of the effort has been directed towards finding algorithms with low computational cost for various restricted classes of models. Nevertheless, for learning Ising models on general graphs with $p$ nodes of degree at most $d$, it is not known whether or not it is possible to improve upon the $p^{d}$ computation needed to exhaustively search over all possible neighborhoods for each node. In this paper we show that a simple greedy procedure allows to learn the structure of an Ising model on an arbitrary bounded-degree graph in time on the order of $p^2$. We make no assumptions on the parameters except what is necessary for identifiability of the model, and in particular the results hold at low-temperatures as well as for highly non-uniform models. The proof rests on a new structural property of Ising models: we show that for any node there exists at least one neighbor with which it has a high mutual information. This structural property may be of independent interest.


Efficient inference of overlapping communities in complex networks

arXiv.org Machine Learning

We discuss two views on extending existing methods for complex network modeling which we dub the communities first and the networks first view, respectively. Inspired by the networks first view that we attribute to White, Boorman, and Breiger (1976)[1], we formulate the multiple-networks stochastic blockmodel (MNSBM), which seeks to separate the observed network into subnetworks of different types and where the problem of inferring structure in each subnetwork becomes easier. We show how this model is specified in a generative Bayesian framework where parameters can be inferred efficiently using Gibbs sampling. The result is an effective multiple-membership model without the drawbacks of introducing complex definitions of "groups" and how they interact. We demonstrate results on the recovery of planted structure in synthetic networks and show very encouraging results on link prediction performances using multiple-networks models on a number of real-world network data sets.


A Nonparametric Bayesian Approach to Uncovering Rat Hippocampal Population Codes During Spatial Navigation

arXiv.org Machine Learning

Rodent hippocampal population codes represent important spatial information about the environment during navigation. Several computational methods have been developed to uncover the neural representation of spatial topology embedded in rodent hippocampal ensemble spike activity. Here we extend our previous work and propose a nonparametric Bayesian approach to infer rat hippocampal population codes during spatial navigation. To tackle the model selection problem, we leverage a nonparametric Bayesian model. Specifically, to analyze rat hippocampal ensemble spiking activity, we apply a hierarchical Dirichlet process-hidden Markov model (HDP-HMM) using two Bayesian inference methods, one based on Markov chain Monte Carlo (MCMC) and the other based on variational Bayes (VB). We demonstrate the effectiveness of our Bayesian approaches on recordings from a freely-behaving rat navigating in an open field environment. We find that MCMC-based inference with Hamiltonian Monte Carlo (HMC) hyperparameter sampling is flexible and efficient, and outperforms VB and MCMC approaches with hyperparameters set by empirical Bayes.


The Poisson transform for unnormalised statistical models

arXiv.org Machine Learning

Contrary to standard statistical models, unnormalised statistical models only specify the likelihood function up to a constant. While such models are natural and popular, the lack of normalisation makes inference much more difficult. Here we show that inferring the parameters of a unnormalised model on a space $\Omega$ can be mapped onto an equivalent problem of estimating the intensity of a Poisson point process on $\Omega$. The unnormalised statistical model now specifies an intensity function that does not need to be normalised. Effectively, the normalisation constant may now be inferred as just another parameter, at no loss of information. The result can be extended to cover non-IID models, which includes for example unnormalised models for sequences of graphs (dynamical graphs), or for sequences of binary vectors. As a consequence, we prove that unnormalised parameteric inference in non-IID models can be turned into a semi-parametric estimation problem. Moreover, we show that the noise-contrastive divergence of Gutmann & Hyv\"arinen (2012) can be understood as an approximation of the Poisson transform, and extended to non-IID settings. We use our results to fit spatial Markov chain models of eye movements, where the Poisson transform allows us to turn a highly non-standard model into vanilla semi-parametric logistic regression.


Noise Benefits in Expectation-Maximization Algorithms

arXiv.org Machine Learning

This dissertation shows that careful injection of noise into sample data can substantially speed up Expectation-Maximization algorithms. Expectation-Maximization algorithms are a class of iterative algorithms for extracting maximum likelihood estimates from corrupted or incomplete data. The convergence speed-up is an example of a noise benefit or "stochastic resonance" in statistical signal processing. The dissertation presents derivations of sufficient conditions for such noise-benefits and demonstrates the speed-up in some ubiquitous signal-processing algorithms. These algorithms include parameter estimation for mixture models, the $k$-means clustering algorithm, the Baum-Welch algorithm for training hidden Markov models, and backpropagation for training feedforward artificial neural networks. This dissertation also analyses the effects of data and model corruption on the more general Bayesian inference estimation framework. The main finding is a theorem guaranteeing that uniform approximators for Bayesian model functions produce uniform approximators for the posterior pdf via Bayes theorem. This result also applies to hierarchical and multidimensional Bayesian models.


bartMachine: Machine Learning with Bayesian Additive Regression Trees

arXiv.org Machine Learning

Ensemble-of-trees methods have become popular choices for forecasting in both regression and classification problems. Algorithms such as random forests (Breiman 2001) and stochastic gradient boosting (Friedman 2002) are two well-established and widely employed procedures. Recent advances in ensemble methods include dynamic trees (Taddy, Gramacy, and Polson 2011) and Bayesian additive regression trees (BART, Chipman, George, and McCulloch 2010), which depart from predecessors in that they rely on an underlying Bayesian probability model rather than a pure algorithm. BART has demonstrated substantial promise in a wide variety of simulations and real world applications such as predicting avalanches on mountain roads (Blattenberger and Fowles 2014), predicting how transcription factors interact with DNA (Zhou and Liu 2008) and predicting movie box office revenues (Eliashberg 2010). This paper introduces bartMachine, a new R (R Core Team 2014) package available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package


A Greedy, Flexible Algorithm to Learn an Optimal Bayesian Network Structure

arXiv.org Machine Learning

In this report paper we first present a report of the Advanced Machine Learning Course Project on the provided data set and then present a novel heuristic algorithm for exact Bayesian network (BN) structure discovery that uses decomposable scoring functions. Our algorithm follows a different approach to solve the problem of BN structure discovery than the previously used methods such as Dynamic Programming (DP) and Branch and Bound to reduce the search space and find the global optima space for the problem. The algorithm we propose has some degree of flexibility that can make it more or less greedy. The more the algorithm is set to be greedy, the more the speed of the algorithm will be, and the less optimal the final structure. Our algorithm runs in a much less time than the previously known methods and guarantees to have an optimality of close to 99%.


SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics

arXiv.org Artificial Intelligence

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as MCMC sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and/or memory-based hardware limits. Two opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models, child-node contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete nodes, concurrent sampling of conditionally-independent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-level source-code and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for NUMA architectures. For big-data Bayesian GLM graphs, the end-result is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memory-bandwidth-induced hardware limit. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and GPUs), resulting in cost-effectiveness, energy efficiency, and higher speed on multi-core x86 processors.