Goto

Collaborating Authors

 Bayesian Inference


Bayesian optimization for materials design

arXiv.org Machine Learning

We introduce Bayesian optimization, a technique developed for optimizing time-consuming engineering simulations and for fitting machine learning models on large datasets. Bayesian optimization guides the choice of experiments during materials design and discovery to find good material designs in as few experiments as possible. We focus on the case when materials designs are parameterized by a low-dimensional vector. Bayesian optimization is built on a statistical technique called Gaussian process regression, which allows predicting the performance of a new design based on previously tested designs. After providing a detailed introduction to Gaussian process regression, we introduce two Bayesian optimization methods: expected improvement, for design problems with noise-free evaluations; and the knowledge-gradient method, which generalizes expected improvement and may be used in design problems with noisy evaluations. Both methods are derived using a value-of-information analysis, and enjoy one-step Bayes-optimality.


Probabilistic Network Metrics: Variational Bayesian Network Centrality

arXiv.org Machine Learning

Network metrics form a fundamental part of the network analysis toolbox. Used to quantitatively measure different aspects of the network, these metrics can give insights into the underlying network structure and function. In this work, we connect network metrics to modern probabilistic machine learning. We focus on the centrality metric, which is used a wide variety of applications from web search to gene-analysis. First, we formulate an eigenvector-based Bayesian centrality model for determining node importance. Compared to existing methods, our probabilistic model allows for the assimilation of multiple edge weight observations, the inclusion of priors and the extraction of uncertainties. To enable tractable inference, we develop a variational lower bound (VBC) that is demonstrated to be effective on a variety of networks (two synthetic and five real-world graphs). We then bridge this model to sparse Gaussian processes. The sparse variational Bayesian centrality Gaussian process (VBC-GP) learns a mapping between node attributes to latent centrality and hence, is capable of predicting centralities from node features and can potentially represent a large number of nodes using only a limited number of inducing inputs. Experiments show that the VBC-GP learns high-quality mappings and compares favorably to a two-step baseline, i.e., a full GP trained on the node attributes and pre-computed centralities. Finally, we present two case-studies using the VBC-GP: first, to ascertain relevant features in a taxi transport network and second, to distribute a limited number of vaccines to mitigate the severity of a viral outbreak.


Bayesian Network Constraint-Based Structure Learning Algorithms: Parallel and Optimised Implementations in the bnlearn R Package

arXiv.org Artificial Intelligence

It is well known in the literature that the problem of learning the structure of Bayesian networks is very hard to tackle: its computational complexity is super-exponential in the number of nodes in the worst case and polynomial in most real-world scenarios. Efficient implementations of score-based structure learning benefit from past and current research in optimisation theory, which can be adapted to the task by using the network score as the objective function to maximise. This is not true for approaches based on conditional independence tests, called constraint-based learning algorithms. The only optimisation in widespread use, backtracking, leverages the symmetries implied by the definitions of neighbourhood and Markov blanket. In this paper we illustrate how backtracking is implemented in recent versions of the bnlearn R package, and how it degrades the stability of Bayesian network structure learning for little gain in terms of speed. As an alternative, we describe a software architecture and framework that can be used to parallelise constraint-based structure learning algorithms (also implemented in bnlearn) and we demonstrate its performance using four reference networks and two real-world data sets from genetics and systems biology. We show that on modern multi-core or multiprocessor hardware parallel implementations are preferable over backtracking, which was developed when single-processor machines were the norm.


On the Computational Complexity of High-Dimensional Bayesian Variable Selection

arXiv.org Machine Learning

We study the computational complexity of Markov chain Monte Carlo (MCMC) methods for high-dimensional Bayesian linear regression under sparsity constraints. We first show that a Bayesian approach can achieve variable-selection consistency under relatively mild conditions on the design matrix. We then demonstrate that the statistical criterion of posterior concentration need not imply the computational desideratum of rapid mixing of the MCMC algorithm. By introducing a truncated sparsity prior for variable selection, we provide a set of conditions that guarantee both variable-selection consistency and rapid mixing of a particular Metropolis-Hastings algorithm. The mixing time is linear in the number of covariates up to a logarithmic factor. Our proof controls the spectral gap of the Markov chain by constructing a canonical path ensemble that is inspired by the steps taken by greedy algorithms for variable selection.


A trust-region method for stochastic variational inference with applications to streaming data

arXiv.org Machine Learning

Stochastic variational inference allows for fast posterior inference in complex Bayesian models. However, the algorithm is prone to local optima which can make the quality of the posterior approximation sensitive to the choice of hyperparameters and initialization. We address this problem by replacing the natural gradient step of stochastic varitional inference with a trust-region update. We show that this leads to generally better results and reduced sensitivity to hyperparameters. We also describe a new strategy for variational inference on streaming data and show that here our trust-region method is crucial for getting good performance.


Belief Flows of Robust Online Learning

arXiv.org Machine Learning

This paper introduces a new probabilistic model for online learning which dynamically incorporates information from stochastic gradients of an arbitrary loss function. Similar to probabilistic filtering, the model maintains a Gaussian belief over the optimal weight parameters. Unlike traditional Bayesian updates, the model incorporates a small number of gradient evaluations at locations chosen using Thompson sampling, making it computationally tractable. The belief is then transformed via a linear flow field which optimally updates the belief distribution using rules derived from information theoretic principles. Several versions of the algorithm are shown using different constraints on the flow field and compared with conventional online learning algorithms. Results are given for several classification tasks including logistic regression and multilayer neural networks.


Discrete Independent Component Analysis (DICA) with Belief Propagation

arXiv.org Machine Learning

We apply belief propagation to a Bayesian bipartite graph composed of discrete independent hidden variables and discrete visible variables. The network is the Discrete counterpart of Independent Component Analysis (DICA) and it is manipulated in a factor graph form for inference and learning. A full set of simulations is reported for character images from the MNIST dataset. The results show that the factorial code implemented by the sources contributes to build a good generative model for the data that can be used in various inference modes.


Stochastic Annealing for Variational Inference

arXiv.org Machine Learning

Machine learning has produced a wide variety of useful tools for addressing a number of practical problems, often for those which involve large-scale datasets. Indeed, a number of disciplines ranging from recommender systems to bioinformatics rely on machine intelligence to extract useful information from their datasets in an efficient manner. One of the core machine learning approaches to such tasks is to define a prior over a model on data and infer the model parameters through posterior inference (Blei, 2014). The gold-standard in this direction is Markov chain Monte Carlo (MCMC), which gives a means for collecting samples from this posterior distribution in an asymptotically correct way (Robert & Casella, 2004). A frequent criticism of MCMC is that it is not scalable to large data sets--though recent work has begun to address this (e.g., Welling & Teh (2011); Maclaurin & Adams (2014)).


Weight Uncertainty in Neural Networks

arXiv.org Machine Learning

We introduce a new, efficient, principled and backpropagation-compatible algorithm for learning a probability distribution on the weights of a neural network, called Bayes by Backprop. It regularises the weights by minimising a compression cost, known as the variational free energy or the expected lower bound on the marginal likelihood. We show that this principled kind of regularisation yields comparable performance to dropout on MNIST classification. We then demonstrate how the learnt uncertainty in the weights can be used to improve generalisation in non-linear regression problems, and how this weight uncertainty can be used to drive the exploration-exploitation trade-off in reinforcement learning.


On distinguishability criteria for estimating generative models

arXiv.org Machine Learning

Two recently introduced criteria for estimation of generative models are both based on a reduction to binary classification. Noise-contrastive estimation (NCE) is an estimation procedure in which a generative model is trained to be able to distinguish data samples from noise samples. Generative adversarial networks (GANs) are pairs of generator and discriminator networks, with the generator network learning to generate samples by attempting to fool the discriminator network into believing its samples are real data. Both estimation procedures use the same function to drive learning, which naturally raises questions about how they are related to each other, as well as whether this function is related to maximum likelihood estimation (MLE). NCE corresponds to training an internal data model belonging to the {\em discriminator} network but using a fixed generator network. We show that a variant of NCE, with a dynamic generator network, is equivalent to maximum likelihood estimation. Since pairing a learned discriminator with an appropriate dynamically selected generator recovers MLE, one might expect the reverse to hold for pairing a learned generator with a certain discriminator. However, we show that recovering MLE for a learned generator requires departing from the distinguishability game. Specifically: (i) The expected gradient of the NCE discriminator can be made to match the expected gradient of MLE, if one is allowed to use a non-stationary noise distribution for NCE, (ii) No choice of discriminator network can make the expected gradient for the GAN generator match that of MLE, and (iii) The existing theory does not guarantee that GANs will converge in the non-convex case. This suggests that the key next step in GAN research is to determine whether GANs converge, and if not, to modify their training algorithm to force convergence.