Goto

Collaborating Authors

 Bayesian Learning


Bayesian inference for spatio-temporal spike-and-slab priors

arXiv.org Machine Learning

In this work, we address the problem of solving a series of underdetermined linear inverse problemblems subject to a sparsity constraint. We generalize the spike-and-slab prior distribution to encode a priori correlation of the support of the solution in both space and time by imposing a transformed Gaussian process on the spike-and-slab probabilities. An expectation propagation (EP) algorithm for posterior inference under the proposed model is derived. For large scale problems, the standard EP algorithm can be prohibitively slow. We therefore introduce three different approximation schemes to reduce the computational complexity. Finally, we demonstrate the proposed model using numerical experiments based on both synthetic and real data sets.


Uncertainty Estimates for Efficient Neural Network-based Dialogue Policy Optimisation

arXiv.org Machine Learning

In statistical dialogue management, the dialogue manager learns a policy that maps a belief state to an action for the system to perform. Efficient exploration is key to successful policy optimisation. Current deep reinforcement learning methods are very promising but rely on epsilon-greedy exploration, thus subjecting the user to a random choice of action during learning. Alternative approaches such as Gaussian Process SARSA (GPSARSA) estimate uncertainties and are sample efficient, leading to better user experience, but on the expense of a greater computational complexity. This paper examines approaches to extract uncertainty estimates from deep Q-networks (DQN) in the context of dialogue management. We perform an extensive benchmark of deep Bayesian methods to extract uncertainty estimates, namely Bayes-By-Backprop, dropout, its concrete variation, bootstrapped ensemble and alpha-divergences, combining it with DQN algorithm.


Techniques for proving Asynchronous Convergence results for Markov Chain Monte Carlo methods

arXiv.org Machine Learning

Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling are finding widespread use in applied statistics and machine learning. These often lead to difficult computational problems, which are increasingly being solved on parallel and distributed systems such as compute clusters. Recent work has proposed running iterative algorithms such as gradient descent and MCMC in parallel asynchronously for increased performance, with good empirical results in certain problems. Unfortunately, for MCMC this parallelization technique requires new convergence theory, as it has been explicitly demonstrated to lead to divergence on some examples. Recent theory on Asynchronous Gibbs sampling describes why these algorithms can fail, and provides a way to alter them to make them converge. In this article, we describe how to apply this theory in a generic setting, to understand the asynchronous behavior of any MCMC algorithm, including those implemented using parameter servers, and those not based on Gibbs sampling.


An Improved Naive Bayes Classifier-based Noise Detection Technique for Classifying User Phone Call Behavior

arXiv.org Machine Learning

The presence of noisy instances in mobile phone data is a fundamental issue for classifying user phone call behavior (i.e., accept, reject, missed and outgoing), with many potential negative consequences. The classification accuracy may decrease and the complexity of the classifiers may increase due to the number of redundant training samples. To detect such noisy instances from a training dataset, researchers use naive Bayes classifier (NBC) as it identifies misclassified instances by taking into account independence assumption and conditional probabilities of the attributes. However, some of these misclassified instances might indicate usages behavioral patterns of individual mobile phone users. Existing naive Bayes classifier based noise detection techniques have not considered this issue and, thus, are lacking in classification accuracy. In this paper, we propose an improved noise detection technique based on naive Bayes classifier for effectively classifying users' phone call behaviors. In order to improve the classification accuracy, we effectively identify noisy instances from the training dataset by analyzing the behavioral patterns of individuals. We dynamically determine a noise threshold according to individual's unique behavioral patterns by using both the naive Bayes classifier and Laplace estimator. We use this noise threshold to identify noisy instances. To measure the effectiveness of our technique in classifying user phone call behavior, we employ the most popular classification algorithm (e.g., decision tree). Experimental results on the real phone call log dataset show that our proposed technique more accurately identifies the noisy instances from the training datasets that leads to better classification accuracy.


A Semantic Loss Function for Deep Learning with Symbolic Knowledge

arXiv.org Machine Learning

This paper develops a novel methodology for using symbolic knowledge in deep learning. From first principles, we derive a semantic loss function that bridges between neural output vectors and logical constraints. This loss function captures how close the neural network is to satisfying the constraints on its output. An experimental evaluation shows that our semantic loss function effectively guides the learner to achieve (near-)state-of-the-art results on semi-supervised multi-class classification. Moreover, it significantly increases the ability of the neural network to predict structured objects, such as rankings and paths. These discrete concepts are tremendously difficult to learn, and benefit from a tight integration of deep learning and symbolic reasoning methods.


Gaussian Process Neurons Learn Stochastic Activation Functions

arXiv.org Machine Learning

We propose stochastic, non-parametric activation functions that are fully learnable and individual to each neuron. Complexity and the risk of overfitting are controlled by placing a Gaussian process prior over these functions. The result is the Gaussian process neuron, a probabilistic unit that can be used as the basic building block for probabilistic graphical models that resemble the structure of neural networks. The proposed model can intrinsically handle uncertainties in its inputs and self-estimate the confidence of its predictions. Using variational Bayesian inference and the central limit theorem, a fully deterministic loss function is derived, allowing it to be trained as efficiently as a conventional neural network using mini-batch gradient descent. The posterior distribution of activation functions is inferred from the training data alongside the weights of the network. The proposed model favorably compares to deep Gaussian processes, both in model complexity and efficiency of inference. It can be directly applied to recurrent or convolutional network structures, allowing its use in audio and image processing tasks. As an preliminary empirical evaluation we present experiments on regression and classification tasks, in which our model achieves performance comparable to or better than a Dropout regularized neural network with a fixed activation function. Experiments are ongoing and results will be added as they become available.


On the use of bootstrap with variational inference: Theory, interpretation, and a two-sample test example

arXiv.org Machine Learning

Variational inference is a general approach for approximating complex density functions, such as those arising in latent variable models, popular in machine learning. It has been applied to approximate the maximum likelihood estimator and to carry out Bayesian inference, however, quantification of uncertainty with variational inference remains challenging from both theoretical and practical perspectives. This paper is concerned with developing uncertainty measures for variational inference by using bootstrap procedures. We first develop two general bootstrap approaches for assessing the uncertainty of a variational estimate and the study the underlying bootstrap theory in both fixed- and increasing-dimension settings. We then use the bootstrap approach and our theoretical results in the context of mixed membership modeling with multivariate binary data on functional disability from the National Long Term Care Survey. We carry out a two-sample approach to test for changes in the repeated measures of functional disability for the subset of individuals present in 1984 and 1994 waves.


Particle Optimization in Stochastic Gradient MCMC

arXiv.org Machine Learning

Stochastic gradient Markov chain Monte Carlo (SG-MCMC) has been increasingly popular in Bayesian learning due to its ability to deal with large data. A standard SG-MCMC algorithm simulates samples from a discretized-time Markov chain to approximate a target distribution. However, the samples are typically highly correlated due to the sequential generation process, an undesired property in SG-MCMC. In contrary, Stein variational gradient descent (SVGD) directly optimizes a set of particles, and it is able to approximate a target distribution with much fewer samples. In this paper, we propose a novel method to directly optimize particles (or samples) in SG-MCMC from scratch. Specifically, we propose efficient methods to solve the corresponding Fokker-Planck equation on the space of probability distributions, whose solution (i.e., a distribution) is approximated by particles. Through our framework, we are able to show connections of SG-MCMC to SVGD, as well as the seemly unrelated generative-adversarial-net framework. Under certain relaxations, particle optimization in SG-MCMC can be interpreted as an extension of standard SVGD with momentum.


Efficient exploration with Double Uncertain Value Networks

arXiv.org Machine Learning

This paper studies directed exploration for reinforcement learning agents by tracking uncertainty about the value of each available action. We identify two sources of uncertainty that are relevant for exploration. The first originates from limited data (parametric uncertainty), while the second originates from the distribution of the returns (return uncertainty). We identify methods to learn these distributions with deep neural networks, where we estimate parametric uncertainty with Bayesian drop-out, while return uncertainty is propagated through the Bellman equation as a Gaussian distribution. Then, we identify that both can be jointly estimated in one network, which we call the Double Uncertain Value Network. The policy is directly derived from the learned distributions based on Thompson sampling. Experimental results show that both types of uncertainty may vastly improve learning in domains with a strong exploration challenge.


Snorkel: Rapid Training Data Creation with Weak Supervision

arXiv.org Machine Learning

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8x faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8x speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.