Goto

Collaborating Authors

 Bayesian Inference


Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction

Neural Information Processing Systems

Vision transformer networks have shown superiority in many computer vision tasks. In this paper, we take a step further by proposing a novel generative vision transformer with latent variables following an informative energy-based prior for salient object detection. Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation, in which the sampling from the intractable posterior and prior distributions of the latent variables are performed by Langevin dynamics. Further, with the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image. Different from the existing generative models which define the prior distribution of the latent variables as a simple isotropic Gaussian distribution, our model uses an energy-based informative prior which can be more expressive to capture the latent space of the data.


Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference

Neural Information Processing Systems

Group fairness is measured via parity of quantitative metrics across different protected demographic groups. In this paper, we investigate the problem of reliably assessing group fairness metrics when labeled examples are few but unlabeled examples are plentiful. We propose a general Bayesian framework that can augment labeled data with unlabeled data to produce more accurate and lower-variance estimates compared to methods based on labeled data alone. Our approach estimates calibrated scores (for unlabeled examples) of each group using a hierarchical latent variable model conditioned on labeled examples. This in turn allows for inference of posterior distributions for an array of group fairness metrics with a notion of uncertainty.


Automatic Variational Inference in Stan

Neural Information Processing Systems

Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult for non-experts to use. We propose an automatic variational inference algorithm, automatic differentiation variational inference (ADVI); we implement it in Stan (code available), a probabilistic programming system. In ADVI the user provides a Bayesian model and a dataset, nothing else. We make no conjugacy assumptions and support a broad class of models.


X-CAL: Explicit Calibration for Survival Analysis

Neural Information Processing Systems

When a model's predicted number of events within any time interval is similar to the observed number, it is called well-calibrated. A survival model's calibration can be measured using, for instance, distributional calibration (D-CALIBRATION) [Haider et al., 2020] which computes the squared difference between the observed and predicted number of events within different time intervals. Classically, calibration is addressed in post-training analysis. We develop explicit calibration (X-CAL), which turns D-CALIBRATION into a differentiable objective that can be used in survival modeling alongside maximum likelihood estimation and other objectives. X-CAL allows us to directly optimize calibration and strike a desired trade-off between predictive power and calibration.


Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination

Neural Information Processing Systems

Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate.Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence.Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size.Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.


Replica-Exchange Nos\'e-Hoover Dynamics for Bayesian Learning on Large Datasets

Neural Information Processing Systems

In this paper, we present a new practical method for Bayesian learning that can rapidly draw representative samples from complex posterior distributions with multiple isolated modes in the presence of mini-batch noise. This is achieved by simulating a collection of replicas in parallel with different temperatures and periodically swapping them. When evolving the replicas' states, the Nos\'e-Hoover dynamics is applied, which adaptively neutralizes the mini-batch noise. To perform proper exchanges, a new protocol is developed with a noise-aware test of acceptance, by which the detailed balance is reserved in an asymptotic way. While its efficacy on complex multimodal posteriors has been illustrated by testing over synthetic distributions, experiments with deep Bayesian neural networks on large-scale datasets have shown its significant improvements over strong baselines.


Maximum Likelihood Learning With Arbitrary Treewidth via Fast-Mixing Parameter Sets

Neural Information Processing Systems

Inference is typically intractable in high-treewidth undirected graphical models, making maximum likelihood learning a challenge. One way to overcome this is to restrict parameters to a tractable set, most typically the set of tree-structured parameters. This paper explores an alternative notion of a tractable set, namely a set of "fast-mixing parameters" where Markov chain Monte Carlo (MCMC) inference can be guaranteed to quickly converge to the stationary distribution. While it is common in practice to approximate the likelihood gradient using samples obtained from MCMC, such procedures lack theoretical guarantees. This paper proves that for any exponential family with bounded sufficient statistics, (not just graphical models) when parameters are constrained to a fast-mixing set, gradient descent with gradients approximated by sampling will approximate the maximum likelihood solution inside the set with high-probability.


Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

Neural Information Processing Systems

Combining discrete probability distributions and combinatorial optimization problems with neural network components has numerous applications but poses several challenges. We propose Implicit Maximum Likelihood Estimation (I-MLE), a framework for end-to-end learning of models combining discrete exponential family distributions and differentiable neural components. I-MLE is widely applicable as it only requires the ability to compute the most probable states and does not rely on smooth relaxations. The framework encompasses several approaches such as perturbation-based implicit differentiation and recent methods to differentiate through black-box combinatorial solvers. We introduce a novel class of noise distributions for approximating marginals via perturb-and-MAP.


Independence Testing for Bounded Degree Bayesian Networks

Neural Information Processing Systems

We study the following independence testing problem: given access to samples from a distribution P over \{0,1\} n, decide whether P is a product distribution or whether it is \varepsilon -far in total variation distance from any product distribution. For arbitrary distributions, this problem requires \exp(n) samples. We show in this work that if P has a sparse structure, then in fact only linearly many samples are required.Specifically, if P is Markov with respect to a Bayesian network whose underlying DAG has in-degree bounded by d, then \tilde{\Theta}(2 {d/2}\cdot n/\varepsilon 2) samples are necessary and sufficient for independence testing.


The Broad Optimality of Profile Maximum Likelihood

Neural Information Processing Systems

We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size k and desired accuracy \varepsilon: \textbf{Distribution estimation} Under \ell_1 distance, PML yields optimal \Theta(k/(\varepsilon 2\log k)) sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution; \textbf{Additive property estimation} For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence; \textbf{ \alpha -R\'enyi entropy estimation} For an integer \alpha 1, the PML plug-in estimator has optimal k {1-1/\alpha} sample complexity; for non-integer \alpha 3/4, the PML plug-in estimator has sample complexity lower than the state of the art; \textbf{Identity testing} In testing whether an unknown distribution is equal to or at least \varepsilon far from a given distribution in \ell_1 distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of k . With minor modifications, most of these results also hold for a near-linear-time computable variant of PML.