Uncertainty
Delayed acceptance ABC-SMC
Everitt, Richard G., Rowińska, Paulina A.
Approximate Bayesian computation (ABC) is now an established technique for statistical inference used in cases where the likelihood function is computationally expensive or not available. It relies on the use of a model that is specified in the form of a simulator, and approximates the likelihood at a parameter $\theta$ by simulating auxiliary data sets $x$ and evaluating the distance of $x$ from the true data $y$. However, ABC is not computationally feasible in cases where using the simulator for each $\theta$ is very expensive. This paper investigates this situation in cases where a cheap, but approximate, simulator is available. The approach is to employ delayed acceptance Markov chain Monte Carlo (MCMC) within an ABC sequential Monte Carlo (SMC) sampler in order to, in a first stage of the kernel, use the cheap simulator to rule out parts of the parameter space that are not worth exploring, so that the "true" simulator is only run (in the second stage of the kernel) where there is a reasonable chance of accepting proposed values of $\theta$. We show that this approach can be used quite automatically, with the only tuning parameter choice additional to ABC-SMC being the number of particles we wish to carry through to the second stage of the kernel. Applications to stochastic differential equation models and latent doubly intractable distributions are presented.
A network approach to topic models
Gerlach, Martin, Peixoto, Tiago P., Altmann, Eduardo G.
One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here, we approach the problem of identifying topical structures by representing text corpora as bipartite networks of documents and words and using methods from community detection in complex networks, in particular stochastic block models (SBM). We show that our SBM-based approach constitutes a more principled and versatile framework for topic modeling solving the intrinsic limitations of Dirichlet-based models through a more general choice of nonparametric priors. It automatically detects the number of topics and hierarchically clusters both the words and documents. In practice, we demonstrate through the analysis of artificial and real corpora that our approach outperforms LDA in terms of statistical model selection.
Learning Model Reparametrizations: Implicit Variational Inference by Fitting MCMC distributions
Consider a probabilistic model with joint distribution p(x, z) where x are data and z are latent variables and/or random parameters. Suppose that exact inference in p(x, z) is intractable which means that the posterior distribution p(z x) p(x, z) p(x, z)dz, is difficult to compute due to the normalizing constant p(x) p(x, z)dz that represents the probability of the data and it is known as evidence or marginal likelihood. The marginal likelihood is essential for estimation of any extra parameters in p(x) or for model comparison. Approximate inference algorithms target to approximate p(z x) and/or p(x). Two general frameworks, that we briefly review next, are based on Markov chain Monte Carlo (MCMC) [33, 2] and variational inference (VI) [17, 40].
A Latent Variable Model for Two-Dimensional Canonical Correlation Analysis and its Variational Inference
Safayani, Mehran, Momenzadeh, Saeid
Describing the dimension reduction (DR) techniques by means of probabilistic models has recently been given special attention. Probabilistic models, in addition to a better interpretability of the DR methods, provide a framework for further extensions of such algorithms. One of the new approaches to the probabilistic DR methods is to preserving the internal structure of data. It is meant that it is not necessary that the data first be converted from the matrix or tensor format to the vector format in the process of dimensionality reduction. In this paper, a latent variable model for matrix-variate data for canonical correlation analysis (CCA) is proposed. Since in general there is not any analytical maximum likelihood solution for this model, we present two approaches for learning the parameters. The proposed methods are evaluated using the synthetic data in terms of convergence and quality of mappings. Also, real data set is employed for assessing the proposed methods with several probabilistic and none-probabilistic CCA based approaches. The results confirm the superiority of the proposed methods with respect to the competing algorithms. Moreover, this model can be considered as a framework for further extensions.
Learning Approximately Objective Priors
Nalisnick, Eric, Smyth, Padhraic
Informative Bayesian priors are often difficult to elicit, and when this is the case, modelers usually turn to noninformative or objective priors. However, objective priors such as the Jeffreys and reference priors are not tractable to derive for many models of interest. We address this issue by proposing techniques for learning reference prior approximations: we select a parametric family and optimize a black-box lower bound on the reference prior objective to find the member of the family that serves as a good approximation. We experimentally demonstrate the method's effectiveness by recovering Jeffreys priors and learning the Variational Autoencoder's reference prior.
A Bayesian Approach to Policy Recognition and State Representation Learning
Šošić, Adrian, Zoubir, Abdelhak M., Koeppl, Heinz
Learning from demonstration (LfD) is the process of building behavioral models of a task from demonstrations provided by an expert. These models can be used e.g. for system control by generalizing the expert demonstrations to previously unencountered situations. Most LfD methods, however, make strong assumptions about the expert behavior, e.g. they assume the existence of a deterministic optimal ground truth policy or require direct monitoring of the expert's controls, which limits their practical use as part of a general system identification framework. In this work, we consider the LfD problem in a more general setting where we allow for arbitrary stochastic expert policies, without reasoning about the optimality of the demonstrations. Following a Bayesian methodology, we model the full posterior distribution of possible expert controllers that explain the provided demonstration data. Moreover, we show that our methodology can be applied in a nonparametric context to infer the complexity of the state representation used by the expert, and to learn task-appropriate partitionings of the system state space.
A glass-box interactive machine learning approach for solving NP-hard problems with the human-in-the-loop
Holzinger, Andreas, Plass, Markus, Holzinger, Katharina, Crisan, Gloria Cerasela, Pintea, Camelia-M., Palade, Vasile
The goal of Machine Learning to automatically learn from data, extract knowledge and to make decisions without any human intervention. Such automatic (aML) approaches show impressive success. Recent results even demonstrate intriguingly that deep learning applied for automatic classification of skin lesions is on par with the performance of dermatologists, yet outperforms the average. As human perception is inherently limited, such approaches can discover patterns, e.g. that two objects are similar, in arbitrarily high-dimensional spaces what no human is able to do. Humans can deal only with limited amounts of data, whilst big data is beneficial for aML; however, in health informatics, we are often confronted with a small number of data sets, where aML suffer of insufficient training samples and many problems are computationally hard. Here, interactive machine learning (iML) may be of help, where a human-in-the-loop contributes to reduce the complexity of NP-hard problems. A further motivation for iML is that standard black-box approaches lack transparency, hence do not foster trust and acceptance of ML among end-users. Rising legal and privacy aspects, e.g. with the new European General Data Protection Regulations, make black-box approaches difficult to use, because they often are not able to explain why a decision has been made. In this paper, we present some experiments to demonstrate the effectiveness of the human-in-the-loop approach, particularly in opening the black-box to a glass-box and thus enabling a human directly to interact with an learning algorithm. We selected the Ant Colony Optimization framework, and applied it on the Traveling Salesman Problem, which is a good example, due to its relevance for health informatics, e.g. for the study of protein folding. From studies of how humans extract so much from so little data, fundamental ML-research also may benefit.
Learning to Discover Sparse Graphical Models
Belilovsky, Eugene, Kastner, Kyle, Varoquaux, Gaël, Blaschko, Matthew
We consider structure discovery of undirected graphical models from observational data. Inferring likely structures from few examples is a complex task often requiring the formulation of priors and sophisticated inference procedures. Popular methods rely on estimating a penalized maximum likelihood of the precision matrix. However, in these approaches structure recovery is an indirect consequence of the data-fit term, the penalty can be difficult to adapt for domain-specific knowledge, and the inference is computationally demanding. By contrast, it may be easier to generate training samples of data that arise from graphs with the desired structure properties. We propose here to leverage this latter source of information as training data to learn a function, parametrized by a neural network that maps empirical covariance matrices to estimated graph structures. Learning this function brings two benefits: it implicitly models the desired structure or sparsity properties to form suitable priors, and it can be tailored to the specific problem of edge structure discovery, rather than maximizing data likelihood. Applying this framework, we find our learnable graph-discovery method trained on synthetic data generalizes well: identifying relevant edges in both synthetic and real data, completely unknown at training time. We find that on genetics, brain imaging, and simulation data we obtain performance generally superior to analytical methods.
Optimal Belief Approximation
Leike, Reimar H., Enßlin, Torsten A.
In Bayesian statistics, probabilities are interpreted as degrees of belief. For any set of mutually exclusive and exhaustive events, one expresses the state of knowledge as a probability distribution over that set. The probability of an event then describes the personal confidence that this event will happen or has happened. As a consequence, probabilities are subjective properties reflecting the amount of knowledge an observer has about the events; a different observer might know which event happened and assign different probabilities. If an observer gains information, she updates the probabilities she had assigned before. If the set of possible mutually exclusive and exhaustive events is infinite, it is generally impossible to store all entries of the corresponding probability distribution on a computer or communicate it through a channel with finite bandwidth. One therefore needs to approximate the probability distribution which describes one's belief. Given a limited set X of approximative beliefs q(s) on a quantity s, what is the best belief to approximate the actual belief as expressed by the probability p(s)? In the literature, it is sometimes claimed that the best approximation is given by the q X that minimizes the Kullback-Leibler divergence ("approximation" KL) [1] KL(p, q) () p(s) p(s) ln (1) q(s)
Hamiltonian Monte Carlo with Energy Conserving Subsampling
Dang, Khue-Dung, Quiroz, Matias, Kohn, Robert, Tran, Minh-Ngoc, Villani, Mattias
Hamiltonian Monte Carlo (HMC) has recently received considerable attention in the literature due to its ability to overcome the slow exploration of the parameter space inherent in random walk proposals. In tandem, data subsampling has been extensively used to overcome the computational bottlenecks in posterior sampling algorithms that require evaluating the likelihood over the whole data set, or its gradient. However, while data subsampling has been successful in traditional MCMC algorithms such as Metropolis-Hastings, it has been demonstrated to be unsuccessful in the context of HMC, both in terms of poor sampling efficiency and in producing highly biased inferences. We propose an efficient HMC-within-Gibbs algorithm that utilizes data subsampling to speed up computations and simulates from a slightly perturbed target, which is within $O(m^{-2})$ of the true target, where $m$ is the size of the subsample. We also show how to modify the method to obtain exact inference on any function of the parameters. Contrary to previous unsuccessful approaches, we perform subsampling in a way that conserves energy but for a modified Hamiltonian. We can therefore maintain high acceptance rates even for distant proposals. We apply the method for simulating from the posterior distribution of a high-dimensional spline model for bankruptcy data and document speed ups of several orders of magnitude compare to standard HMC and, moreover, demonstrate a negligible bias.