Bayesian Inference
Measuring Sample Quality with Diffusions
Gorham, Jackson, Duncan, Andrew B., Vollmer, Sebastian J., Mackey, Lester
Stein's method for measuring convergence to a continuous target distribution relies on an operator characterizing the target and Stein factor bounds on the solutions of an associated differential equation. While such operators and bounds are readily available for a diversity of univariate targets, few multivariate targets have been analyzed. We introduce a new class of characterizing operators based on Ito diffusions and develop explicit multivariate Stein factor bounds for any target with a fast-coupling Ito diffusion. As example applications, we develop computable and convergence-determining diffusion Stein discrepancies for log-concave, heavy-tailed, and multimodal targets and use these quality measures to select the hyperparameters of biased Markov chain Monte Carlo (MCMC) samplers, compare random and deterministic quadrature rules, and quantify bias-variance tradeoffs in approximate MCMC. Our results establish a near-linear relationship between diffusion Stein discrepancies and Wasserstein distances, improving upon past work even for strongly log-concave targets. The exposed relationship between Stein factors and Markov process coupling may be of independent interest.
Heron Inference for Bayesian Graphical Models
Rugeles, Daniel, Hai, Zhen, Cong, Gao, Dash, Manoranjan
Bayesian graphical models have been shown to be a powerful tool for discovering uncertainty and causal structure from real-world data in many application fields. Current inference methods primarily follow different kinds of trade-offs between computational complexity and predictive accuracy. At one end of the spectrum, variational inference approaches perform well in computational efficiency, while at the other end, Gibbs sampling approaches are known to be relatively accurate for prediction in practice. In this paper, we extend an existing Gibbs sampling method, and propose a new deterministic Heron inference (Heron) for a family of Bayesian graphical models. In addition to the support for nontrivial distributability, one more benefit of Heron is that it is able to not only allow us to easily assess the convergence status but also largely improve the running efficiency. We evaluate Heron against the standard collapsed Gibbs sampler and state-of-the-art state augmentation method in inference for well-known graphical models. Experimental results using publicly available real-life data have demonstrated that Heron significantly outperforms the baseline methods for inferring Bayesian graphical models.
Bayes' Rule Applied – Towards Data Science
The fundamental idea of Bayesian inference is to become "less wrong" with more data. The process is straightforward: we have an initial belief, known as a prior, which we update as we gain additional information. Although we don't think about it as Bayesian Inference, we use this technique all the time. For example, we might initially think there is a 50% chance we will get a promotion at the end of the quarter. If we receive positive feedback from our manager, we adjust our estimate upwards, and conversely, we might decrease the probability if we make a mess with the coffee machine.
Bayesian Uncertainty Estimation for Batch Normalized Deep Networks
Teye, Mattias, Azizpour, Hossein, Smith, Kevin
Deep neural networks have led to a series of breakthroughs, dramatically improving the state-of-the-art in many domains. The techniques driving these advances, however, lack a formal method to account for model uncertainty. While the Bayesian approach to learning provides a solid theoretical framework to handle uncertainty, inference in Bayesian-inspired deep neural networks is difficult. In this paper, we provide a practical approach to Bayesian learning that relies on a regularization technique found in nearly every modern network, \textit{batch normalization}. We show that training a deep network using batch normalization is equivalent to approximate inference in Bayesian models, and we demonstrate how this finding allows us to make useful estimates of the model uncertainty. With our approach, it is possible to make meaningful uncertainty estimates using conventional architectures without modifying the network or the training procedure. Our approach is thoroughly validated in a series of empirical experiments on different tasks and using various measures, outperforming baselines with strong statistical significance and displaying competitive performance with other recent Bayesian approaches.
Leveraging the Exact Likelihood of Deep Latent Variable Models
Mattei, Pierre-Alexandre, Frellsen, Jes
Deep latent variable models combine the approximation abilities of deep neural networks and the statistical foundations of generative models. The induced data distribution is an infinite mixture model whose density is extremely delicate to compute. Variational methods are consequently used for inference, following the seminal work of Rezende et al. (2014) and Kingma and Welling (2014). We study the well-posedness of the exact problem (maximum likelihood) these techniques approximatively solve. In particular, we show that most unconstrained models used for continuous data have an unbounded likelihood. This ill-posedness and the problems it causes are illustrated on real data. We also show how to insure the existence of maximum likelihood estimates, and draw useful connections with nonparametric mixture models. Furthermore, we describe an algorithm that allows to perform missing data imputation using the exact conditional likelihood of a deep latent variable model. On several real data sets, our algorithm consistently and significantly outperforms the usual imputation scheme used within deep latent variable models.
Recovering a Hidden Community in a Preferential Attachment Graph
Hajek, Bruce, Sankagiri, Suryanarayana
A message passing algorithm is derived for recovering a dense subgraph within a graph generated by a variation of the Barab\'asi-Albert preferential attachment model. The estimator is assumed to know the arrival times, or order of attachment, of the vertices. The derivation of the algorithm is based on belief propagation under an independence assumption. Two precursors to the message passing algorithm are analyzed: the first is a degree thresholding (DT) algorithm and the second is an algorithm based on the arrival times of the children (C) of a given vertex, where the children of a given vertex are the vertices that attached to it. Algorithm C significantly outperforms DT, showing it is beneficial to know the arrival times of the children, beyond simply knowing the number of them. For fixed fraction of vertices in the community, fixed number of new edges per arriving vertex, and fixed affinity between vertices in the community, the probability of error for recovering the label of a vertex is found as a function of the time of attachment, for either algorithm DT or C, in the large graph limit. By averaging over the time of attachment, the limit in probability of the fraction of label errors made over all vertices is identified, for either of the algorithms DT or C.
Bayesian Methods for Hackers
Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining. The choice of PyMC as the probabilistic programming language is two-fold. As of this writing, there is currently no central resource for examples and explanations in the PyMC universe.
Automatic feature engineering using Generative Adversarial Networks
The purpose of deep learning is to learn a representation of high dimensional and noisy data using a sequence of differentiable functions, i.e., geometric transformations, that can perhaps be used for supervised learning tasks among other tasks. It has had great success in discriminative models while generative models have not fared perhaps quite as well due to the limitations of explicit maximum likelihood estimation (MLE). Adversarial learning as presented in the Generative Adversarial Network (GAN) aims to overcome these problems by using implicit MLE. We will use the MNIST computer vision dataset and a synthetic financial transactions dataset for an insurance task for these experiments using GANs. GANs are a remarkably different method of learning compared to explicit MLE. Our purpose will be to show that the representation learnt by a GAN can be used for supervised learning tasks such as image recognition and insurance loss risk prediction.
Generating Neural Networks with Neural Networks
Hypernetworks are neural networks that transform a random input vector into weights for a specified target neural network. We formulate the hypernetwork training objective as a compromise between accuracy and diversity, where the diversity takes into account trivial symmetry transformations of the target network. We show that this formulation naturally arises as a relaxation of an optimistic probability distribution objective for the generated networks, and we explain how it is related to variational inference. We use multi-layered perceptrons to form the mapping from the low dimensional input random vector to the high dimensional weight space, and demonstrate how to reduce the number of parameters in this mapping by weight sharing. We perform experiments on a four layer convolutional target network which classifies MNIST images, and show that the generated weights are diverse and have interesting distributions.
Online Machine Learning in Big Data Streams
Benczúr, András A., Kocsis, Levente, Pálovics, Róbert
The area of online machine learning in big data streams covers algorithms that are (1) distributed and (2) work from data streams with only a limited possibility to store past data. The first requirement mostly concerns software architectures and efficient algorithms. The second one also imposes nontrivial theoretical restrictions on the modeling methods: In the data stream model, older data is no longer available to revise earlier suboptimal modeling decisions as the fresh data arrives. In this article, we provide an overview of distributed software architectures and libraries as well as machine learning models for online learning. We highlight the most important ideas for classification, regression, recommendation, and unsupervised modeling from streaming data, and we show how they are implemented in various distributed data stream processing systems. This article is a reference material and not a survey. We do not attempt to be comprehensive in describing all existing methods and solutions; rather, we give pointers to the most important resources in the field. All related sub-fields, online algorithms, online learning, and distributed data processing are hugely dominant in current research and development with conceptually new research results and software components emerging at the time of writing. In this article, we refer to several survey results, both for distributed data processing and for online machine learning. Compared to past surveys, our article is different because we discuss recommender systems in extended detail.