Bayesian Inference
Unsupervised Learning of Noisy-Or Bayesian Networks
Halpern, Yonatan, Sontag, David
This paper considers the problem of learning the parameters in Bayesian networks of discrete variables with known structure and hidden variables. Previous approaches in these settings typically use expectation maximization; when the network has high treewidth, the required expectations might be approximated using Monte Carlo or variational methods. We show how to avoid inference altogether during learning by giving a polynomial-time algorithm based on the method-of-moments, building upon recent work on learning discrete-valued mixture models. In particular, we show how to learn the parameters for a family of bipartite noisy-or Bayesian networks. In our experimental results, we demonstrate an application of our algorithm to learning QMR-DT, a large Bayesian network used for medical diagnosis. We show that it is possible to fully learn the parameters of QMR-DT even when only the findings are observed in the training data (ground truth diseases unknown).
SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure
We give a new consistent scoring function for structure learning of Bayesian networks. In contrast to traditional approaches to scorebased structure learning, such as BDeu or MDL, the complexity penalty that we propose is data-dependent and is given by the probability that a conditional independence test correctly shows that an edge cannot exist. What really distinguishes this new scoring function from earlier work is that it has the property of becoming computationally easier to maximize as the amount of data increases. We prove a polynomial sample complexity result, showing that maximizing this score is guaranteed to correctly learn a structure with no false edges and a distribution close to the generating distribution, whenever there exists a Bayesian network which is a perfect map for the data generating distribution. Although the new score can be used with any search algorithm, we give empirical results showing that it is particularly effective when used together with a linear programming relaxation approach to Bayesian network structure learning.
Measure Transformer Semantics for Bayesian Machine Learning
Borgstrรถm, Johannes, Gordon, Andrew D, Greenberg, Michael, Margetson, James, Van Gael, Jurgen
The Bayesian approach to machine learning amounts to computing posterior distributions of random variables from a probabilistic model of how the variables are related (that is, a prior distribution) and a set of observations of variables. There is a trend in machine learning towards expressing Bayesian models as probabilistic programs. As a foundation for this kind of programming, we propose a core functional calculus with primitives for sampling prior distributions and observing variables. We define measure-transformer combinators inspired by theorems in measure theory, and use these to give a rigorous semantics to our core calculus. The original features of our semantics include its support for discrete, continuous, and hybrid measures, and, in particular, for observations of zero-probability events. We compile our core language to a small imperative language that is processed by an existing inference engine for factor graphs, which are data structures that enable many efficient inference algorithms. This allows efficient approximate inference of posterior marginal distributions, treating thousands of observations per second for large instances of realistic models.
Latent Fisher Discriminant Analysis
Linear Discriminant Analysis (LDA) is a well-known method for dimensionality reduction and classification. Previous studies have also extended the binary-class case into multi-classes. However, many applications, such as object detection and keyframe extraction cannot provide consistent instance-label pairs, while LDA requires labels on instance level for training. Thus it cannot be directly applied for semi-supervised classification problem. In this paper, we overcome this limitation and propose a latent variable Fisher discriminant analysis model. We relax the instance-level labeling into bag-level, is a kind of semi-supervised (video-level labels of event type are required for semantic frame extraction) and incorporates a data-driven prior over the latent variables. Hence, our method combines the latent variable inference and dimension reduction in an unified bayesian framework. We test our method on MUSK and Corel data sets and yield competitive results compared to the baseline approach. We also demonstrate its capacity on the challenging TRECVID MED11 dataset for semantic keyframe extraction and conduct a human-factors ranking-based experimental evaluation, which clearly demonstrates our proposed method consistently extracts more semantically meaningful keyframes than challenging baselines.
Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes
Frigola, Roger, Rasmussen, Carl Edward
We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system's dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.
Exponentially Fast Parameter Estimation in Networks Using Distributed Dual Averaging
Shahrampour, Shahin, Jadbabaie, Ali
In this paper we present an optimization-based view of distributed parameter estimation and observational social learning in networks. Agents receive a sequence of random, independent and identically distributed (i.i.d.) signals, each of which individually may not be informative about the underlying true state, but the signals together are globally informative enough to make the true state identifiable. Using an optimization-based characterization of Bayesian learning as proximal stochastic gradient descent (with Kullback-Leibler divergence from a prior as a proximal function), we show how to efficiently use a distributed, online variant of Nesterov's dual averaging method to solve the estimation with purely local information. When the true state is globally identifiable, and the network is connected, we prove that agents eventually learn the true parameter using a randomized gossip scheme. We demonstrate that with high probability the convergence is exponentially fast with a rate dependent on the KL divergence of observations under the true state from observations under the second likeliest state. Furthermore, our work also highlights the possibility of learning under continuous adaptation of network which is a consequence of employing constant, unit stepsize for the algorithm.
Efficient Monte Carlo Methods for Multi-Dimensional Learning with Classifier Chains
Read, Jesse, Martino, Luca, Luengo, David
Multidimensional classification (MDC) is the supervised learning problem where an instance is associated with multiple classes, rather than with a single class, as in traditional classification problems. Since these classes are often strongly correlated, modeling the dependencies between them allows MDC methods to improve their performance - at the expense of an increased computational cost. In this paper we focus on the classifier chains (CC) approach for modeling dependencies, one of the most popular and highestperforming methods for multi-label classification (MLC), a particular case of MDC which involves only binary classes (i.e., labels). The original CC algorithm makes a greedy approximation, and is fast but tends to propagate errors along the chain. Our algorithms remain tractable for high-dimensional data sets and obtain the best predictive performance across several real data sets. Keywords: classifier chains, multidimensional classification, multi-label classification, Monte Carlo methods, Bayesian inference 1. Introduction Multidimensional classification (MDC) is the supervised learning problem where an instance may be associated with multiple classes, rather than Preprint submitted to Pattern Recognition March 22, 2018 with a single class as in traditional binary or multi-class single-dimensional classification (SDC) problems. So-called MDC (e.g., in [1]) is also known in the literature as multi-target, multi-output [2], or multi-objective [3] classification The recently popularised task of multi-label classification (see [4, 5, 6, 7] for overviews) can be viewed as a particular case of the multidimensional problem that only involves binary classes, i.e., labels that can be turned on (1) or off (0) for any data instance. The MDC learning context is receiving increased attention in the literature, since it arises naturally in a wide variety of domains, such as image classification [8, 9], information retrieval and text categorization [10], automated detection of emotions in music [11] or bioinformatics [10, 12].
Variational Bayes Approximations for Clustering via Mixtures of Normal Inverse Gaussian Distributions
Subedi, Sanjeena, McNicholas, Paul D.
The use of mixture models for clustering, referred to as model-based clustering, has become increasingly popular since the work of Wolfe (1963). A wide variety of finite mixture models has been studied extensively within the literature to date. Amongst these, the Gaussian mixture model has received special attention due to its mathematical tractability and the relative computational simplicity associated with parameter estimation. However, the Gaussian mixture model is not without limitations; for instance, the component densities are restricted to being symmetric.
BayesOpt: A Library for Bayesian optimization with Robotics Applications
The purpose of this paper is twofold. On one side, we present a general framework for Bayesian optimization and we compare it with some related fields in active learning and Bayesian numerical analysis. On the other hand, Bayesian optimization and related problems (bandits, sequential experimental design) are highly dependent on the surrogate model that is selected. However, there is no clear standard in the literature. Thus, we present a fast and flexible toolbox that allows to test and combine different models and criteria with little effort. It includes most of the state-of-the-art contributions, algorithms and models. Its speed also removes part of the stigma that Bayesian optimization methods are only good for "expensive functions". The software is free and it can be used in many operating systems and computer languages.
Scalable Probabilistic Entity-Topic Modeling
Houlsby, Neil, Ciaramita, Massimiliano
We present an LDA approach to entity disambiguation. Each topic is associated with a Wikipedia article and topics generate either content words or entity mentions. Training such models is challenging because of the topic and vocabulary size, both in the millions. We tackle these problems using a novel distributed inference and representation framework based on a parallel Gibbs sampler guided by the Wikipedia link graph, and pipelines of MapReduce allowing fast and memory-frugal processing of large datasets. We report state-of-the-art performance on a public dataset.