Bayesian Inference
Efficient Bayes Inference in Neural Networks through Adaptive Importance Sampling
Huang, Yunshi, Chouzenoux, Emilie, Elvira, Victor, Pesquet, Jean-Christophe
Bayesian neural networks (BNNs) have received an increased interest in the last years. In BNNs, a complete posterior distribution of the unknown weight and bias parameters of the network is produced during the training stage. This probabilistic estimation offers several advantages with respect to point-wise estimates, in particular, the ability to provide uncertainty quantification when predicting new data. This feature inherent to the Bayesian paradigm, is useful in countless machine learning applications. It is particularly appealing in areas where decision-making has a crucial impact, such as medical healthcare or autonomous driving. The main challenge of BNNs is the computational cost of the training procedure since Bayesian techniques often face a severe curse of dimensionality. Adaptive importance sampling (AIS) is one of the most prominent Monte Carlo methodologies benefiting from sounded convergence guarantees and ease for adaptation. This work aims to show that AIS constitutes a successful approach for designing BNNs. More precisely, we propose a novel algorithm PMCnet that includes an efficient adaptation mechanism, exploiting geometric information on the complex (often multimodal) posterior distribution. Numerical results illustrate the excellent performance and the improved exploration capabilities of the proposed method for both shallow and deep neural networks.
Bayes classifier cannot be learned from noisy responses with unknown noise rates
Training a classifier with noisy labels typically requires the learner to specify the distribution of label noise, which is often unknown in practice. Although there have been some recent attempts to relax that requirement, we show that the Bayes decision rule is unidentified in most classification problems with noisy labels. This suggests it is generally not possible to bypass/relax the requirement. In the special cases in which the Bayes decision rule is identified, we develop a simple algorithm to learn the Bayes decision rule, that does not require knowledge of the noise distribution. In this paper, we consider classification with noisy labels.
Do deep neural networks have an inbuilt Occam's razor?
Mingard, Chris, Rees, Henry, Valle-Pérez, Guillermo, Louis, Ard A.
The remarkable performance of overparameterized deep neural networks (DNNs) must arise from an interplay between network architecture, training algorithms, and structure in the data. To disentangle these three components, we apply a Bayesian picture, based on the functions expressed by a DNN, to supervised learning. The prior over functions is determined by the network, and is varied by exploiting a transition between ordered and chaotic regimes. For Boolean function classification, we approximate the likelihood using the error spectrum of functions on data. When combined with the prior, this accurately predicts the posterior, measured for DNNs trained with stochastic gradient descent. This analysis reveals that structured data, combined with an intrinsic Occam's razor-like inductive bias towards (Kolmogorov) simple functions that is strong enough to counteract the exponential growth of the number of functions with complexity, is a key to the success of DNNs.
Toward Reliable Human Pose Forecasting with Uncertainty
Saadatnejad, Saeed, Mirmohammadi, Mehrshad, Daghyani, Matin, Saremi, Parham, Benisi, Yashar Zoroofchi, Alimohammadi, Amirhossein, Tehraninasab, Zahra, Mordan, Taylor, Alahi, Alexandre
Recently, there has been an arms race of pose forecasting methods aimed at solving the spatio-temporal task of predicting a sequence of future 3D poses of a person given a sequence of past observed ones. However, the lack of unified benchmarks and limited uncertainty analysis have hindered progress in the field. To address this, we first develop an open-source library for human pose forecasting, featuring multiple models, datasets, and standardized evaluation metrics, with the aim of promoting research and moving toward a unified and fair evaluation. Second, we devise two types of uncertainty in the problem to increase performance and convey better trust: 1) we propose a method for modeling Figure 1: We propose to model two kinds of uncertainty: aleatoric uncertainty by using uncertainty priors to inject 1) aleatoric uncertainty, learned by our model to capture knowledge about the behavior of uncertainty. This focuses the temporal evolution of uncertainty, which becomes more the capacity of the model in the direction of more meaningful prominent over time, as depicted by the lighter colors and supervision while reducing the number of learned parameters thicker bones for the right person; 2) epistemic uncertainty and improving stability; 2) we introduce a novel to detect out-of-distribution forecast poses coming from unseen approach for quantifying the epistemic uncertainty of any scenarios in training, such as for the left person.
Ranking from Pairwise Comparisons in General Graphs and Graphs with Locality
This technical report studies the problem of ranking from pairwise comparisons in the classical Bradley-Terry-Luce (BTL) model, with a focus on score estimation. For general graphs, we show that, with sufficiently many samples, maximum likelihood estimation (MLE) achieves an entrywise estimation error matching the Cram\'er-Rao lower bound, which can be stated in terms of effective resistances; the key to our analysis is a connection between statistical estimation and iterative optimization by preconditioned gradient descent. We are also particularly interested in graphs with locality, where only nearby items can be connected by edges; our analysis identifies conditions under which locality does not hurt, i.e. comparing the scores between a pair of items that are far apart in the graph is nearly as easy as comparing a pair of nearby items. We further explore divide-and-conquer algorithms that can provably achieve similar guarantees even in the regime with the sparsest samples, while enjoying certain computational advantages. Numerical results validate our theory and confirm the efficacy of the proposed algorithms.
Bayesian Inference for Jump-Diffusion Approximations of Biochemical Reaction Networks
Altıntan, Derya, Alt, Bastian, Koeppl, Heinz
Biochemical reaction networks are an amalgamation of reactions where each reaction represents the interaction of different species. Generally, these networks exhibit a multi-scale behavior caused by the high variability in reaction rates and abundances of species. The so-called jump-diffusion approximation is a valuable tool in the modeling of such systems. The approximation is constructed by partitioning the reaction network into a fast and slow subgroup of fast and slow reactions, respectively. This enables the modeling of the dynamics using a Langevin equation for the fast group, while a Markov jump process model is kept for the dynamics of the slow group. Most often biochemical processes are poorly characterized in terms of parameters and population states. As a result of this, methods for estimating hidden quantities are of significant interest. In this paper, we develop a tractable Bayesian inference algorithm based on Markov chain Monte Carlo. The presented blocked Gibbs particle smoothing algorithm utilizes a sequential Monte Carlo method to estimate the latent states and performs distinct Gibbs steps for the parameters of a biochemical reaction network, by exploiting a jump-diffusion approximation model. The presented blocked Gibbs sampler is based on the two distinct steps of state inference and parameter inference. We estimate states via a continuous-time forward-filtering backward-smoothing procedure in the state inference step. By utilizing bootstrap particle filtering within a backward-smoothing procedure, we sample a smoothing trajectory. For estimating the hidden parameters, we utilize a separate Markov chain Monte Carlo sampler within the Gibbs sampler that uses the path-wise continuous-time representation of the reaction counters. Finally, the algorithm is numerically evaluated for a partially observed multi-scale birth-death process example.
Fair Grading Algorithms for Randomized Exams
Chen, Jiale, Hartline, Jason, Zoeter, Onno
In a randomized exam, each student is asked a small number of random questions from a large question bank. The predominant grading rule is simple averaging, i.e., calculating grades by averaging scores on the questions each student is asked, which is fair ex-ante, over the randomized questions, but not fair ex-post, on the realized questions. The fair grading problem is to estimate the average grade of each student on the full question bank. The maximum-likelihood estimator for the Bradley-Terry-Luce model on the bipartite student-question graph is shown to be consistent with high probability when the number of questions asked to each student is at least the cubed-logarithm of the number of students. In an empirical study on exam data and in simulations, our algorithm based on the maximum-likelihood estimator significantly outperforms simple averaging in prediction accuracy and ex-post fairness even with a small class and exam size.
EGC: Image Generation and Classification via a Diffusion Energy-Based Model
Guo, Qiushan, Ma, Chuofan, Jiang, Yi, Yuan, Zehuan, Yu, Yizhou, Luo, Ping
Learning image classification and image generation using the same set of network parameters is a challenging problem. Recent advanced approaches perform well in one task often exhibit poor performance in the other. This work introduces an energy-based classifier and generator, namely EGC, which can achieve superior performance in both tasks using a single neural network. Unlike a conventional classifier that outputs a label given an image (i.e., a conditional distribution $p(y|\mathbf{x})$), the forward pass in EGC is a classifier that outputs a joint distribution $p(\mathbf{x},y)$, enabling an image generator in its backward pass by marginalizing out the label $y$. This is done by estimating the energy and classification probability given a noisy image in the forward pass, while denoising it using the score function estimated in the backward pass. EGC achieves competitive generation results compared with state-of-the-art approaches on ImageNet-1k, CelebA-HQ and LSUN Church, while achieving superior classification accuracy and robustness against adversarial attacks on CIFAR-10. This work represents the first successful attempt to simultaneously excel in both tasks using a single set of network parameters. We believe that EGC bridges the gap between discriminative and generative learning.
Near-Optimal Degree Testing for Bayes Nets
Arora, Vipul, Bhattacharyya, Arnab, Canonne, Clément L., Yang, Joy Qiping
This paper considers the problem of testing the maximum in-degree of the Bayes net underlying an unknown probability distribution $P$ over $\{0,1\}^n$, given sample access to $P$. We show that the sample complexity of the problem is $\tilde{\Theta}(2^{n/2}/\varepsilon^2)$. Our algorithm relies on a testing-by-learning framework, previously used to obtain sample-optimal testers; in order to apply this framework, we develop new algorithms for ``near-proper'' learning of Bayes nets, and high-probability learning under $\chi^2$ divergence, which are of independent interest.
Bayesian Causal Inference in Doubly Gaussian DAG-probit Models
Tahmasbi, Rasool, Tahmasbi, Keyvan
We consider modeling a binary response variable together with a set of covariates for two groups under observational data. The grouping variable can be the confounding variable (the common cause of treatment and outcome), gender, case/control, ethnicity, etc. Given the covariates and a binary latent variable, the goal is to construct two directed acyclic graphs (DAGs), while sharing some common parameters. The set of nodes, which represent the variables, are the same for both groups but the directed edges between nodes, which represent the causal relationships between the variables, can be potentially different. For each group, we also estimate the effect size for each node. We assume that each group follows a Gaussian distribution under its DAG. Given the parent nodes, the joint distribution of DAG is conditionally independent due to the Markov property of DAGs. We introduce the concept of Gaussian DAG-probit model under two groups and hence doubly Gaussian DAG-probit model. To estimate the skeleton of the DAGs and the model parameters, we took samples from the posterior distribution of doubly Gaussian DAG-probit model via MCMC method. We validated the proposed method using a comprehensive simulation experiment and applied it on two real datasets. Furthermore, we validated the results of the real data analysis using well-known experimental studies to show the value of the proposed grouping variable in the causality domain.