Bayesian Inference
Accurate Uncertainty Estimation and Decomposition in Ensemble Learning
Liu, Jeremiah Zhe, Paisley, John, Kioumourtzoglou, Marianthi-Anna, Coull, Brent
Ensemble learning is a standard approach to building machine learning systems that capture complex phenomena in real-world data. An important aspect of these systems is the complete and valid quantification of model uncertainty. We introduce a Bayesian nonparametric ensemble (BNE) approach that augments an existing ensemble model to account for different sources of model uncertainty. BNE augments a model's prediction and distribution functions using Bayesian nonparametric machinery. It has a theoretical guarantee in that it robustly estimates the uncertainty patterns in the data distribution, and can decompose its overall predictive uncertainty into distinct components that are due to different sources of noise and error. We show that our method achieves accurate uncertainty estimates under complex observational noise, and illustrate its real-world utility in terms of uncertainty decomposition and model bias detection for an ensemble in predict air pollution exposures in Eastern Massachusetts, USA.
Alleviating Label Switching with Optimal Transport
Monteiller, Pierre, Claici, Sebastian, Chien, Edward, Mirzazadeh, Farzaneh, Solomon, Justin, Yurochkin, Mikhail
Label switching is a phenomenon arising in mixture model posterior inference that prevents one from meaningfully assessing posterior statistics using standard Monte Carlo procedures. This issue arises due to invariance of the posterior under actions of a group; for example, permuting the ordering of mixture components has no effect on the likelihood. We propose a resolution to label switching that leverages machinery from optimal transport. Our algorithm efficiently computes posterior statistics in the quotient space of the symmetry group. We give conditions under which there is a meaningful solution to label switching and demonstrate advantages over alternative approaches on simulated and real data.
Protecting from Malware Obfuscation Attacks through Adversarial Risk Analysis
Redondo, Alberto, Insua, David Rios
Standard algorithms in detection systems perform insufficiently when dealing with malware passed through obfuscation tools. We illustrate this studying in detail an open source metamorphic software, making use of a hybrid framework to obtain the relevant features from binaries. We then provide an improved alternative solution based on adversarial risk analysis which we illustrate describe with an example. KEYWORDS: Adversarial Risk Analysis, Malware Obfuscation, Cybersecurity 1 INTRODUCTION The digital era is bringing along new global threats among which cybersecurity related ones emerge as truly worrisome, see for example the evolution of the Global Risks Map from the World Economic Forum (2017, 2018, 2019). Indeed, the operation of critical cyber infrastructures relies on components which could be cyber attacked, both incidentally and intentionally, suffering major performance degradation, Rao et al. (2016).
Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses
Azer, Erfan Sadeqi, Khashabi, Daniel, Sabharwal, Ashish, Roth, Dan
Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community. We address this gap by contrasting various hypothesis assessment techniques, especially those not commonly used in the field (such as evaluations based on Bayesian inference). Since these statistical techniques differ in the hypotheses they can support, we argue that practitioners should first decide their target hypothesis before choosing an assessment method. This is crucial because common fallacies, misconceptions, and misinterpretation surrounding hypothesis assessment methods often stem from a discrepancy between what one would like to claim versus what the method used actually assesses. Our survey reveals that these issues are omnipresent in the NLP research community. As a step forward, we provide best practices and guidelines tailored to NLP research, as well as an easy-to-use package called 'HyBayes' for Bayesian assessment of hypotheses, complementing existing tools.
Maximum a-Posteriori Estimation for the Gaussian Mixture Model via Mixed Integer Nonlinear Programming
Flaherty, Patrick, Wiratchotisatian, Pitchaya, Lee, Ji Ah, Trapp, Andrew C.
We present a global optimization approach for solving the classical maximum a-posteriori (MAP) estimation problem for the Gaussian mixture model. Our approach formulates the MAP estimation problem as a mixed-integer nonlinear optimization problem (MINLP). Our method provides a certificate of global optimality, can accommodate side constraints, and is extendable to other finite mixture models. We propose an approximation to the MINLP hat transforms it into a mixed integer quadratic program (MIQP) which preserves global optimality within desired accuracy and improves computational aspects. Numerical experiments compare our method to standard estimation approaches and show that our method finds the globally optimal MAP for some standard data sets, providing a benchmark for comparing estimation methods.
Confidence Intervals for Policy Evaluation in Adaptive Experiments
Hadad, Vitor, Hirshberg, David A., Zhan, Ruohan, Wager, Stefan, Athey, Susan
Adaptive experiments can result in considerable cost savings in multi-armed trials by enabling analysts to quickly focus on the most promising alternatives. Most existing work on adaptive experiments (which include multi-armed bandits) has focused maximizing the speed at which the analyst can identify the optimal arm and/or minimizing the number of draws from sub-optimal arms. In many scientific settings, however, it is not only of interest to identify the optimal arm, but also to perform a statistical analysis of the data collected from the experiment. Naive approaches to statistical inference with adaptive inference fail because many commonly used statistics (such as sample means or inverse propensity weighting) do not have an asymptotically Gaussian limiting distribution centered on the estimate, and so confidence intervals constructed from these statistics do not have correct coverage. But, as shown in this paper, carefully designed data-adaptive weighting schemes can be used to overcome this issue and restore a relevant central limit theorem, enabling hypothesis testing. We validate the accuracy of the resulting confidence intervals in numerical experiments.
Uncertainty relations and fluctuation theorems for Bayes nets
The pioneering paper [Ito and Sagawa, 2013] analyzed the non-equilibrium statistical physics of a set of multiple interacting systems, S, whose joint discrete-time evolution is specified by a Bayesian network. The major result of [Ito and Sagawa, 2013] was an integral fluctuation theorem (IFT) governing the sum of two quantities: the entropy production (EP) of an arbitrary single v in S, and the transfer entropy from v to the other systems. Here I extend the analysis in [Ito and Sagawa, 2013]. I derive several detailed fluctuation theorems (DFTs), concerning arbitrary subsets of all the systems (including the full set). I also derive several associated IFTs, concerning an arbitrary subset of the systems, thereby extending the IFT in [Ito and Sagawa, 2013]. In addition I derive "conditional" DFTs and IFTs, involving conditional probability distributions rather than (as in conventional fluctuation theorems) unconditioned distributions. I then derive thermodynamic uncertainty relations relating the total EP of the Bayes net to the set of all the precisions of probability currents within the individual systems. I end with an example of that uncertainty relation.
Probabilistic Similarity Networks
Normative expert systems have not become commonplace because they have been difficult to build and use. Over the past decade, however, researchers have developed the influence diagram, a graphical representation of a decision maker's beliefs, alternatives, and preferences that serves as the knowledge base of a normative expert system. Most people who have seen the representation find it intuitive and easy to use. Consequently, the influence diagram has overcome significantly the barriers to constructing normative expert systems. Nevertheless, building influence diagrams is not practical for extremely large and complex domains. In this book, I address the difficulties associated with the construction of the probabilistic portion of an influence diagram, called a knowledge map, belief network, or Bayesian network. I introduce two representations that facilitate the generation of large knowledge maps. In particular, I introduce the similarity network, a tool for building the network structure of a knowledge map, and the partition, a tool for assessing the probabilities associated with a knowledge map. I then use these representations to build Pathfinder, a large normative expert system for the diagnosis of lymph-node diseases (the domain contains over 60 diseases and over 100 disease findings). In an early version of the system, I encoded the knowledge of the expert using an erroneous assumption that all disease findings were independent, given each disease. When the expert and I attempted to build a more accurate knowledge map for the domain that would capture the dependencies among the disease findings, we failed. Using a similarity network, however, we built the knowledge-map structure for the entire domain in approximately 40 hours. Furthermore, the partition representation reduced the number of probability assessments required by the expert from 75,000 to 14,000.
Computational Separations between Sampling and Optimization
Two commonly arising computational tasks in Bayesian learning are Optimization (Maximum A Posteriori estimation) and Sampling (from the posterior distribution). In the convex case these two problems are efficiently reducible to each other. Recent work (Ma et al. 2019) shows that in the non-convex case, sampling can sometimes be provably faster. We present a simpler and stronger separation. We then compare sampling and optimization in more detail and show that they are provably incomparable: there are families of continuous functions for which optimization is easy but sampling is NP-hard, and vice versa. Further, we show function families that exhibit a sharp phase transition in the computational complexity of sampling, as one varies the natural temperature parameter. Our results draw on a connection to analogous separations in the discrete setting which are well-studied.
A Method to Model Conditional Distributions with Normalizing Flows
Xiao, Zhisheng, Yan, Qing, Amit, Yali
In this work, we investigate the use of normalizing flows to model conditional distributions. In particular, we use our proposed method to analyze inverse problems with invertible neural networks by maximizing the posterior likelihood. Our method uses only a single loss and is easy to train. This is an improvement on the previous method that solves similar inverse problems with invertible neural networks but which involves a combination of several loss terms with ad-hoc weighting. In addition, our method provides a natural framework to incorporate conditioning in normalizing flows, and therefore, we can train an invertible network to perform conditional generation. We analyze our method and perform a careful comparison with previous approaches. Simple experiments show the effectiveness of our method, and more comprehensive experimental evaluations are undergoing.