Uncertainty
Adaptive Bayesian Sampling with Monte Carlo EM
Roychowdhury, Anirban, Parthasarathy, Srinivasan
We present a novel technique for learning the mass matrices in samplers obtained from discretized dynamics that preserve some energy function. Existing adaptive samplers use Riemannian preconditioning techniques, where the mass matrices are functions of the parameters being sampled. This leads to significant complexities in the energy reformulations and resultant dynamics, often leading to implicit systems of equations and requiring inversion of high-dimensional matrices in the leapfrog steps. Our approach provides a simpler alternative, by using existing dynamics in the sampling step of a Monte Carlo EM framework, and learning the mass matrices in the M step with a novel online technique. We also propose a way to adaptively set the number of samples gathered in the E step, using sampling error estimates from the leapfrog dynamics. Along with a novel stochastic sampler based on Nos\'{e}-Poincar\'{e} dynamics, we use this framework with standard Hamiltonian Monte Carlo (HMC) as well as newer stochastic algorithms such as SGHMC and SGNHT, and show strong performance on synthetic and real high-dimensional sampling scenarios; we achieve sampling accuracies comparable to Riemannian samplers while being significantly faster.
Linear Time Computation of Moments in Sum-Product Networks
Bayesian online algorithms for Sum-Product Networks (SPNs) need to update their posterior distribution after seeing one single additional instance. To do so, they must compute moments of the model parameters under this distribution. The best existing method for computing such moments scales quadratically in the size of the SPN, although it scales linearly for trees. This unfortunate scaling makes Bayesian online algorithms prohibitively expensive, except for small or tree-structured SPNs. We propose an optimal linear-time algorithm that works even when the SPN is a general directed acyclic graph (DAG), which significantly broadens the applicability of Bayesian online algorithms for SPNs. There are three key ingredients in the design and analysis of our algorithm: 1). For each edge in the graph, we construct a linear time reduction from the moment computation problem to a joint inference problem in SPNs. 2). Using the property that each SPN computes a multilinear polynomial, we give an efficient procedure for polynomial evaluation by differentiation without expanding the network that may contain exponentially many monomials. 3). We propose a dynamic programming method to further reduce the computation of the moments of all the edges in the graph from quadratic to linear. We demonstrate the usefulness of our linear time algorithm by applying it to develop a linear time assume density filter (ADF) for SPNs.
Hierarchical Implicit Models and Likelihood-Free Variational Inference
Tran, Dustin, Ranganath, Rajesh, Blei, David M.
Implicit probabilistic models are a flexible class of models defined by a simulation process for data. They form the basis for theories which encompass our understanding of the physical world. Despite this fundamental nature, the use of implicit models remains limited due to challenges in specifying complex latent structure in them, and in performing inferences in such models with large data sets. In this paper, we first introduce hierarchical implicit models (HIMs). HIMs combine the idea of implicit densities with hierarchical Bayesian modeling, thereby defining models via simulators of data with rich hidden structure. Next, we develop likelihood-free variational inference (LFVI), a scalable variational inference algorithm for HIMs. Key to LFVI is specifying a variational family that is also implicit. This matches the model's flexibility and allows for accurate approximation of the posterior. We demonstrate diverse applications: a large-scale physical simulator for predator-prey populations in ecology; a Bayesian generative adversarial network for discrete data; and a deep implicit model for text generation.
AIDE: An algorithm for measuring the accuracy of probabilistic inference algorithms
Cusumano-Towner, Marco F., Mansinghka, Vikash K.
Approximate probabilistic inference algorithms are central to many fields. Examples include sequential Monte Carlo inference in robotics, variational inference in machine learning, and Markov chain Monte Carlo inference in statistics. A key problem faced by practitioners is measuring the accuracy of an approximate inference algorithm on a specific data set. This paper introduces the auxiliary inference divergence estimator (AIDE), an algorithm for measuring the accuracy of approximate inference algorithms. AIDE is based on the observation that inference algorithms can be treated as probabilistic models and the random variables used within the inference algorithm can be viewed as auxiliary variables. This view leads to a new estimator for the symmetric KL divergence between the approximating distributions of two inference algorithms. The paper illustrates application of AIDE to algorithms for inference in regression, hidden Markov, and Dirichlet process mixture models. The experiments show that AIDE captures the qualitative behavior of a broad class of inference algorithms and can detect failure modes of inference algorithms that are missed by standard heuristics.
Implicit Weight Uncertainty in Neural Networks
Pawlowski, Nick, Rajchl, Martin, Glocker, Ben
We interpret HyperNetworks within the framework of variational inference within implicit distributions. Our method, Bayes by Hypernet, is able to model a richer variational distribution than previous methods. Experiments show that it achieves comparable predictive performance on the MNIST classification task while providing higher predictive uncertainties compared to MC-Dropout and regular maximum likelihood training.
Variational Continual Learning
Nguyen, Cuong V., Li, Yingzhen, Bui, Thang D., Turner, Richard E.
This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that variational continual learning outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Lakshminarayanan, Balaji, Pritzel, Alexander, Blundell, Charles
Deep neural networks (NNs) are powerful black box predictors that have recently achieved impressive performance on a wide spectrum of tasks. Quantifying predictive uncertainty in NNs is a challenging and yet unsolved problem. Bayesian NNs, which learn a distribution over weights, are currently the state-of-the-art for estimating predictive uncertainty; however these require significant modifications to the training procedure and are computationally expensive compared to standard (non-Bayesian) NNs. We propose an alternative to Bayesian NNs that is simple to implement, readily parallelizable, requires very little hyperparameter tuning, and yields high quality predictive uncertainty estimates. Through a series of experiments on classification and regression benchmarks, we demonstrate that our method produces well-calibrated uncertainty estimates which are as good or better than approximate Bayesian NNs. To assess robustness to dataset shift, we evaluate the predictive uncertainty on test examples from known and unknown distributions, and show that our method is able to express higher uncertainty on out-of-distribution examples. We demonstrate the scalability of our method by evaluating predictive uncertainty estimates on ImageNet.
Partial correlation graphs and the neighborhood lattice
Amini, Arash A., Aragam, Bryon, Zhou, Qing
We define and study partial correlation graphs (PCGs) with variables in a general Hilbert space and their connections to generalized neighborhood regression, without making any distributional assumptions. Using operator-theoretic arguments, and especially the properties of projection operators on Hilbert spaces, we show that these neighborhood regressions have the algebraic structure of a lattice, which we call a neighborhood lattice. This lattice property significantly reduces the number of conditions one has to check when testing all partial correlation relations among a collection of variables. In addition, we generalize the notion of perfectness in graphical models for a general PCG to this Hilbert space setting, and establish that almost all Gram matrices are perfect. Under this perfectness assumption, we show how these neighborhood lattices may be "graphically" computed using separation properties of PCGs. We also discuss extensions of these ideas to directed models, which present unique challenges compared to their undirected counterparts. Our results have implications for multivariate statistical learning in general, including structural equation models, subspace clustering, and dimension reduction. For example, we discuss how to compute neighborhood lattices efficiently and furthermore how they can be used to reduce the sample complexity of learning directed acyclic graphs. Our work demonstrates that this abstract viewpoint via projection operators significantly simplifies existing ideas and arguments from the graphical modeling literature, and furthermore can be used to extend these ideas to more general nonparametric settings.
Generalized Probabilistic Bisection for Stochastic Root-Finding
Rodriguez, Sergio, Ludkovski, Michael
We consider numerical schemes for root finding of noisy responses through generalizing the Probabilistic Bisection Algorithm (PBA) to the more practical context where the sampling distribution is unknown and location-dependent. As in standard PBA, we rely on a knowledge state for the approximate posterior of the root location. To implement the corresponding Bayesian updating, we also carry out inference of oracle accuracy, namely learning the probability of correct response. To this end we utilize batched querying in combination with a variety of frequentist and Bayesian estimators based on majority vote, as well as the underlying functional responses, if available. For guiding sampling selection we investigate both Information Directed sampling, as well as Quantile sampling. Our numerical experiments show that these strategies perform quite differently; in particular we demonstrate the efficiency of randomized quantile sampling which is reminiscent of Thompson sampling. Our work is motivated by the root-finding sub-routine in pricing of Bermudan financial derivatives, illustrated in the last section of the paper.
Partition mixture of 1D wavelets for multi-dimensional data
Traditional statistical wavelet analysis that carries out modeling and inference based on wavelet coefficients under a given, predetermined wavelet transform can quickly lose efficiency in multivariate problems, because such wavelet transforms, which are typically symmetric with respect to the dimensions, cannot adaptively exploit the energy distribution in a problem-specific manner. We introduce a principled probabilistic framework for incorporating such adaptivity---by (i) representing multivariate functions using one-dimensional (1D) wavelet transforms applied to a permuted version of the original function, and (ii) placing a prior on the corresponding permutation, thereby forming a mixture of permuted 1D wavelet transforms. Such a representation can achieve substantially better energy concentration in the wavelet coefficients. In particular, when combined with the Haar basis, we show that exact Bayesian inference under the model can be achieved analytically through a recursive message passing algorithm with a computational complexity that scales linearly with sample size. In addition, we propose a sequential Monte Carlo (SMC) inference algorithm for other wavelet bases using the exact Haar solution as the proposal. We demonstrate that with this framework even simple 1D Haar wavelets can achieve excellent performance in both 2D and 3D image reconstruction via numerical experiments, outperforming state-of-the-art multidimensional wavelet-based methods especially in low signal-to-noise ratio settings, at a fraction of the computational cost.