Bayesian Inference
Overpruning in Variational Bayesian Neural Networks
Trippe, Brian, Turner, Richard
The motivations for using variational inference (VI) in neural networks differ significantly from those in latent variable models. This has a counter-intuitive consequence; more expressive variational approximations can provide significantly worse predictions as compared to those with less expressive families. In this work we make two contributions. First, we identify a cause of this performance gap, variational over-pruning. Second, we introduce a theoretically grounded explanation for this phenomenon. Our perspective sheds light on several related published results and provides intuition into the design of effective variational approximations of neural networks.
Bayesian stochastic blockmodeling
This chapter provides a self-contained introduction to the use of Bayesian inference to extract large-scale modular structures from network data, based on the stochastic block model (SBM), as well as its degree-corrected and overlapping generalizations. We focus on nonparametric formulations that allow their inference in a manner that prevents overfitting, and enables model selection. We discuss aspects of the choice of priors, in particular how to avoid underfitting via increased Bayesian hierarchies, and we contrast the task of sampling network partitions from the posterior distribution with finding the single point estimate that maximizes it, while describing efficient algorithms to perform either one. We also show how inferring the SBM can be used to predict missing and spurious links, and shed light on the fundamental limitations of the detectability of modular structures in networks.
Nonparametric Bayesian inference of the microcanonical stochastic block model
A principled approach to characterize the hidden structure of networks is to formulate generative models, and then infer their parameters from data. When the desired structure is composed of modules or "communities", a suitable choice for this task is the stochastic block model (SBM), where nodes are divided into groups, and the placement of edges is conditioned on the group memberships. Here, we present a nonparametric Bayesian method to infer the modular structure of empirical networks, including the number of modules and their hierarchical organization. We focus on a microcanonical variant of the SBM, where the structure is imposed via hard constraints, i.e. the generated networks are not allowed to violate the patterns imposed by the model. We show how this simple model variation allows simultaneously for two important improvements over more traditional inference approaches: 1. Deeper Bayesian hierarchies, with noninformative priors replaced by sequences of priors and hyperpriors, that not only remove limitations that seriously degrade the inference on large networks, but also reveal structures at multiple scales; 2. A very efficient inference algorithm that scales well not only for networks with a large number of nodes and edges, but also with an unlimited number of modules. We show also how this approach can be used to sample modular hierarchies from the posterior distribution, as well as to perform model selection. We discuss and analyze the differences between sampling from the posterior and simply finding the single parameter estimate that maximizes it. Furthermore, we expose a direct equivalence between our microcanonical approach and alternative derivations based on the canonical SBM.
Nonparametric weighted stochastic block models
Many network systems lack a natural low-dimensional embedding from which we can readily extract their most prominent large-scale features. Instead, we have to infer this information from data, typically by decomposing the observed network into modules [1]. A principled approach to perform this task is to formulate generative models that allow this modular decomposition to be found via statistical inference [2]. The most fundamental model used for this purpose is the stochastic block model (SBM) [3], which groups nodes according to their probabilities of connection to the rest of the network. However, a central limitation of most SBM implementations is that they are defined strictly for simple or multigraphs. This means that they do not incorporate extra information on the edges, which are typically present in a variety of systems, and are required for an accurate representation of their structure. For example, to the existence of a route between two airports is associated a distance, to the biomass flow between two species in a food web is associated a flow magnitude, etc. In this work, we develop variations of the SBM that allow for this type of information on the edges to be incorporated into the network model and guide the partition of the nodes into groups in a statistically meaningful way. We follow the same basic idea put forth by Aicher et al. [4], who adapted the SBM to weighted networks by including edge values as additional covariates.
Computation of the Maximum Likelihood estimator in low-rank Factor Analysis
Khamaru, Koulik, Mazumder, Rahul
Factor analysis, a classical multivariate statistical technique is popularly used as a fundamental tool for dimensionality reduction in statistics, econometrics and data science. Estimation is often carried out via the Maximum Likelihood (ML) principle, which seeks to maximize the likelihood under the assumption that the positive definite covariance matrix can be decomposed as the sum of a low rank positive semidefinite matrix and a diagonal matrix with nonnegative entries. This leads to a challenging rank constrained nonconvex optimization problem. We reformulate the low rank ML Factor Analysis problem as a nonlinear nonsmooth semidefinite optimization problem, study various structural properties of this reformulation and propose fast and scalable algorithms based on difference of convex (DC) optimization. Our approach has computational guarantees, gracefully scales to large problems, is applicable to situations where the sample covariance matrix is rank deficient and adapts to variants of the ML problem with additional constraints on the problem parameters. Our numerical experiments demonstrate the significant usefulness of our approach over existing state-of-the-art approaches.
Bootstrapped synthetic likelihood
Approximate Bayesian computation (ABC) and synthetic likelihood (SL) techniques have enabled the use of Bayesian inference for models that may be simulated, but for which the likelihood cannot be evaluated pointwise at values of an unknown parameter $\theta$. The main idea in ABC and SL is to, for different values of $\theta$ (usually chosen using a Monte Carlo algorithm), build estimates of the likelihood based on simulations from the model conditional on $\theta$. The quality of these estimates determines the efficiency of an ABC/SL algorithm. In standard ABC/SL, the only means to improve an estimated likelihood at $\theta$ is to simulate more times from the model conditional on $\theta$, which is infeasible in cases where the simulator is computationally expensive. In this paper we describe how to use bootstrapping as a means for improving SL estimates whilst using fewer simulations from the model, and also investigate its use in ABC. Further, we investigate the use of the bag of little bootstraps as a means for applying this approach to large datasets, yielding Monte Carlo algorithms that accurately approximate posterior distributions whilst only simulating subsamples of the full data. Examples of the approach applied to i.i.d., temporal and spatial data are given.
Machine Learning Trick of the Day (7): Density Ratio Trick
A probability on its own is often an uninteresting thing. But when we can compare probabilities, that is when their full splendour is revealed. By comparing probabilities we are able form judgements; by comparing probabilities we can exploit the elements of our world that are probable; by comparing probabilities we can see the value of objects that are rare. In their own ways, all machine learning tricks help us make better probabilistic comparisons. Comparison is the theme of this post--not discussed in this series before--and the right start to this second sprint of machine learning tricks.
The variational Laplace approach to approximate Bayesian inference
Variational approaches to approximate Bayesian inference provide very efficient means of performing parameter estimation and model selection. Among these, so-called variational-Laplace or VL schemes rely on Gaussian approximations to posterior densities on model parameters. In this note, we review the main variants of VL approaches, that follow from considering nonlinear models of continuous and/or categorical data. En passant, we also derive a few novel theoretical results that complete the portfolio of existing analyses of variational Bayesian approaches, including investigations of their asymptotic convergence. We also suggest practical ways of extending existing VL approaches to hierarchical generative models that include (e.g., precision) hyperparameters.
Sparse travel time tomography with adaptive dictionaries
Bianco, Michael, Gerstoft, Peter
We develop a 2D travel time tomography method which regularizes the inversion by modeling groups of slowness pixels from discrete slowness maps, called patches, as sparse linear combinations of atoms from a dictionary. We further propose to learn optimal slowness dictionaries using dictionary learning, in parallel with the inversion. This patch regularization, which we call the local model, is integrated into the overall slowness map, called the global model. Where the local model considers small-scale variations using a sparsity constraint, the global model considers larger-scale features which are constrained using $\ell_2$-norm regularization. This local-global modeling strategy with dictionary learning has been successful for image restoration tasks such as denoising and inpainting, where diverse image content is recovered from noisy or incomplete measurements. We use this strategy in our locally-sparse travel time tomography (LST) approach to model simultaneously smooth and discontinuous slowness features. This is in contrast to conventional tomography methods, which constrain models to be exclusively smooth or discontinuous. We develop a $\textit{maximum a posteriori}$ formulation for LST and exploit the sparsity of slowness patches using dictionary learning. We demonstrate the LST approach on densely, but irregularly sampled synthetic slowness maps.
Asynchronous Stochastic Variational Inference
Mohamad, Saad, Bouchachia, Abdelhamid, Sayed-Mouchaweh, Moamar
Stochastic variational inference (SVI) employs stochastic optimization to scale up Bayesian computation to massive data. Since SVI is at its core a stochastic gradient-based algorithm, horizontal parallelism can be harnessed to allow larger scale inference. We propose a lock-free parallel implementation for SVI which allows distributed computations over multiple slaves in an asynchronous style. We show that our implementation leads to linear speed-up while guaranteeing an asymptotic ergodic convergence rate $O(1/\sqrt(T)$ ) given that the number of slaves is bounded by $\sqrt(T)$ ($T$ is the total number of iterations). The implementation is done in a high-performance computing (HPC) environment using message passing interface (MPI) for python (MPI4py). The extensive empirical evaluation shows that our parallel SVI is lossless, performing comparably well to its counterpart serial SVI with linear speed-up.