Bayesian Inference
A Dynamical Systems Approach for Convergence of the Bayesian EM Algorithm
Romero, Orlando, Das, Subhro, Chen, Pin-Yu, Pequito, Sรฉrgio
Out of the recent advances in systems and control (S\&C)-based analysis of optimization algorithms, not enough work has been specifically dedicated to machine learning (ML) algorithms and its applications. This paper addresses this gap by illustrating how (discrete-time) Lyapunov stability theory can serve as a powerful tool to aid, or even lead, in the analysis (and potential design) of optimization algorithms that are not necessarily gradient-based. The particular ML problem that this paper focuses on is that of parameter estimation in an incomplete-data Bayesian framework via the popular optimization algorithm known as maximum a posteriori expectation-maximization (MAP-EM). Following first principles from dynamical systems stability theory, conditions for convergence of MAP-EM are developed. Furthermore, if additional assumptions are met, we show that fast convergence (linear or quadratic) is achieved, which could have been difficult to unveil without our adopted S\&C approach. The convergence guarantees in this paper effectively expand the set of sufficient conditions for EM applications, thereby demonstrating the potential of similar S\&C-based convergence analysis of other ML algorithms.
Approximate Cross-Validation for Structured Models
Ghosh, Soumya, Stephenson, William T., Nguyen, Tin D., Deshpande, Sameer K., Broderick, Tamara
Many modern data analyses benefit from explicitly modeling dependence structure in data - such as measurements across time or space, ordered words in a sentence, or genes in a genome. A gold standard evaluation technique is structured cross-validation (CV), which leaves out some data subset (such as data within a time interval or data in a geographic region) in each fold. But CV here can be prohibitively slow due to the need to rerun already-expensive learning algorithms many times. Previous work has shown approximate cross-validation (ACV) methods provide a fast and provably accurate alternative in the setting of empirical risk minimization. But this existing ACV work is restricted to simpler models by the assumptions that (i) data across CV folds are independent and (ii) an exact initial model fit is available. In structured data analyses, both these assumptions are often untrue. In the present work, we address (i) by extending ACV to CV schemes with dependence structure between the folds. To address (ii), we verify - both theoretically and empirically - that ACV quality deteriorates smoothly with noise in the initial fit. We demonstrate the accuracy and computational benefits of our proposed methods on a diverse set of real-world applications.
Support Union Recovery in Meta Learning of Gaussian Graphical Models
Zhang, Qian, Zheng, Yilin, Honorio, Jean
In this paper we study Meta learning of Gaussian graphical models. In our setup, each task has a different true precision matrix, each with a possibly different support (i.e., set of edges in the graph). We assume that the union of the supports of all the true precision matrices (i.e., the true support union) is small in size, which relates to sparse graphs. We propose to pool all the samples from different tasks, and estimate a single precision matrix by $\ell_1$-regularized maximum likelihood estimation. We show that with high probability, the support of the estimated single precision matrix is equal to the true support union, provided a sufficient number of samples per task $n \in O((\log N)/K)$, for $N$ nodes and $K$ tasks. That is, one requires less samples per task when more tasks are available. We prove a matching information-theoretic lower bound for the necessary number of samples, which is $n \in \Omega((\log N)/K)$, and thus, our algorithm is minimax optimal. Synthetic experiments validate our theory.
C-SURE: Shrinkage Estimator and Prototype Classifier for Complex-Valued Deep Learning
Xing, Yifei, Chakraborty, Rudrasis, Duan, Minxuan, Yu, Stella
The James-Stein (JS) shrinkage estimator is a biased estimator that captures the mean of Gaussian random vectors.While it has a desirable statistical property of dominance over the maximum likelihood estimator (MLE) in terms of mean squared error (MSE), not much progress has been made on extending the estimator onto manifold-valued data. We propose C-SURE, a novel Stein's unbiased risk estimate (SURE) of the JS estimator on the manifold of complex-valued data with a theoretically proven optimum over MLE. Adapting the architecture of the complex-valued SurReal classifier, we further incorporate C-SURE into a prototype convolutional neural network (CNN) classifier. We compare C-SURE with SurReal and a real-valued baseline on complex-valued MSTAR and RadioML datasets. C-SURE is more accurate and robust than SurReal, and the shrinkage estimator is always better than MLE for the same prototype classifier. Like SurReal, C-SURE is much smaller, outperforming the real-valued baseline on MSTAR (RadioML) with less than 1 percent (3 percent) of the baseline size
Stacking for Non-mixing Bayesian Computations: The Curse and Blessing of Multimodal Posteriors
Yao, Yuling, Vehtari, Aki, Gelman, Andrew
When working with multimodal Bayesian posterior distributions, Markov chain Monte Carlo (MCMC) algorithms can have difficulty moving between modes, and default variational or mode-based approximate inferences will understate posterior uncertainty. And, even if the most important modes can be found, it is difficult to evaluate their relative weights in the posterior. Here we propose an alternative approach, using parallel runs of MCMC, variational, or mode-based inference to hit as many modes or separated regions as possible, and then combining these using importance sampling based Bayesian stacking, a scalable method for constructing a weighted average of distributions so as to maximize cross-validated prediction utility. The result from stacking is not necessarily equivalent, even asymptotically, to fully Bayesian inference, but it serves many of the same goals. Under misspecified models, stacking can give better predictive performance than full Bayesian inference, hence the multimodality can be considered a blessing rather than a curse. We explore with an example where the stacked inference approximates the true data generating process from the misspecified model, an example of inconsistent inference, and non-mixing samplers. We elaborate the practical implantation in the context of latent Dirichlet allocation, Gaussian process regression, hierarchical model, variational inference in horseshoe regression, and neural networks.
Bayesian Neural Networks: An Introduction and Survey
Neural Networks (NNs) have provided state-of-the-art results for many challenging machine learning tasks such as detection, regression and classification across the domains of computer vision, speech recognition and natural language processing. Despite their success, they are often implemented in a frequentist scheme, meaning they are unable to reason about uncertainty in their predictions. This article introduces Bayesian Neural Networks (BNNs) and the seminal research regarding their implementation. Different approximate inference methods are compared, and used to highlight where future research can improve on current methods.
Gradient-EM Bayesian Meta-learning
Bayesian meta-learning enables robust and fast adaptation to new tasks with uncertainty assessment. The key idea behind Bayesian meta-learning is empirical Bayes inference of hierarchical model. In this work, we extend this framework to include a variety of existing methods, before proposing our variant based on gradient-EM algorithm. Our method improves computational efficiency by avoiding back-propagation computation in the meta-update step, which is exhausting for deep neural networks. Furthermore, it provides flexibility to the inner-update optimization procedure by decoupling it from meta-update. Experiments on sinusoidal regression, few-shot image classification, and policy-based reinforcement learning show that our method not only achieves better accuracy with less computation cost, but is also more robust to uncertainty.
Calibration of Model Uncertainty for Dropout Variational Inference
Laves, Max-Heinrich, Ihler, Sontje, Kortmann, Karl-Philipp, Ortmaier, Tobias
The model uncertainty obtained by variational Bayesian inference with Monte Carlo dropout is prone to miscalibration. In this paper, different logit scaling methods are extended to dropout variational inference to recalibrate model uncertainty. Expected uncertainty calibration error (UCE) is presented as a metric to measure miscalibration. The effectiveness of recalibration is evaluated on CIFAR-10/100 and SVHN for recent CNN architectures. Experimental results show that logit scaling considerably reduce miscalibration by means of UCE. Well-calibrated uncertainty enables reliable rejection of uncertain predictions and robust detection of out-of-distribution data.
Identifying Cognitive Radars -- Inverse Reinforcement Learning using Revealed Preferences
Krishnamurthy, Vikram, Angley, Daniel, Evans, Robin, Moran, William
We consider an inverse reinforcement learning problem involving us versus an enemy radar equipped with a Bayesian tracker. By observing the emissions of the enemy radar,how can we identify if the radar is cognitive (constrained utility maximizer)? Given the observed sequence of actions taken by the enemy's radar, we consider three problems: (i) Are the enemy radar's actions (waveform choice, beam scheduling) consistent with constrained utility maximization? If so how can we estimate the cognitive radar's utility function that is consistent with its actions. We formulate and solve the problem in terms of the spectra (eigenvalues) of the state and observation noise covariance matrices, and the algebraic Riccati equation. (ii) How to construct a statistical test for detecting a cognitive radar (constrained utility maximization) when we observe the radar's actions in noise or the radar observes our probe signal in noise? We propose a statistical detector with a tight Type-II error bound. (iii) How can we optimally probe (interrogate) the enemy's radar by choosing our state to minimize the Type-II error of detecting if the radar is deploying an economic rational strategy, subject to a constraint on the Type-I detection error? We present a stochastic optimization algorithm to optimize our probe signal. The main analysis framework used in this paper is that of revealed preferences from microeconomics.
A Bayesian Evaluation Framework for Ground Truth-Free Visual Recognition Tasks
Prijatelj, Derek S., McCurrie, Mel, Scheirer, Walter J.
An interesting development in automatic visual recognition has been the emergence of tasks where it is not possible to assign ground truth labels to images, yet still feasible to collect annotations that reflect human judgements about them. Such tasks include subjective visual attribute assignment and the labeling of ambiguous scenes. Machine learning-based predictors for these tasks rely on supervised training that models the behavior of the annotators, e.g., what would the average person's judgement be for an image? A key open question for this type of work, especially for applications where inconsistency with human behavior can lead to ethical lapses, is how to evaluate the uncertainty of trained predictors. Given that the real answer is unknowable, we are left with often noisy judgements from human annotators to work with. In order to account for the uncertainty that is present, we propose a relative Bayesian framework for evaluating predictors trained on such data. The framework specifies how to estimate a predictor's uncertainty due to the human labels by approximating a conditional distribution and producing a credible interval for the predictions and their measures of performance. The framework is successfully applied to four image classification tasks that use subjective human judgements: facial beauty assessment using the SCUT-FBP5500 dataset, social attribute assignment using data from TestMyBrain.org,