Goto

Collaborating Authors

 Uncertainty


Learnable Explicit Density for Continuous Latent Space and Variational Inference

arXiv.org Machine Learning

In this paper, we study two aspects of the variational autoencoder (VAE): the prior distribution over the latent variables and its corresponding posterior. First, we decompose the learning of VAEs into layerwise density estimation, and argue that having a flexible prior is beneficial to both sample generation and inference. Second, we analyze the family of inverse autoregressive flows (inverse AF) and show that with further improvement, inverse AF could be used as universal approximation to any complicated posterior. Our analysis results in a unified approach to parameterizing a VAE, without the need to restrict ourselves to use factorial Gaussians in the latent real space.


Learning Graphical Models from a Distributed Stream

arXiv.org Machine Learning

A current challenge for data management systems is to support the construction and maintenance of machine learning models over data that is large, multi-dimensional, and evolving. While systems that could support these tasks are emerging, the need to scale to distributed, streaming data requires new models and algorithms. In this setting, as well as computational scalability and model accuracy, we also need to minimize the amount of communication between distributed processors, which is the chief component of latency. We study Bayesian networks, the workhorse of graphical models, and present a communication-efficient method for continuously learning and maintaining a Bayesian network model over data that is arriving as a distributed stream partitioned across multiple processors. We show a strategy for maintaining model parameters that leads to an exponential reduction in communication when compared with baseline approaches to maintain the exact MLE (maximum likelihood estimation). Meanwhile, our strategy provides similar prediction errors for the target distribution and for classification tasks.


Constructing multi-modality and multi-classifier radiomics predictive models through reliable classifier fusion

arXiv.org Machine Learning

Radiomics aims to extract and analyze large numbers of quantitative features from medical images and is highly promising in staging, diagnosing, and predicting outcomes of cancer treatments. Nevertheless, several challenges need to be addressed to construct an optimal radiomics predictive model. First, the predictive performance of the model may be reduced when features extracted from an individual imaging modality are blindly combined into a single predictive model. Second, because many different types of classifiers are available to construct a predictive model, selecting an optimal classifier for a particular application is still challenging. In this work, we developed multi-modality and multi-classifier radiomics predictive models that address the aforementioned issues in currently available models. Specifically, a new reliable classifier fusion strategy was proposed to optimally combine output from different modalities and classifiers. In this strategy, modality-specific classifiers were first trained, and an analytic evidential reasoning (ER) rule was developed to fuse the output score from each modality to construct an optimal predictive model. One public data set and two clinical case studies were performed to validate model performance. The experimental results indicated that the proposed ER rule based radiomics models outperformed the traditional models that rely on a single classifier or simply use combined features from different modalities.


A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation

arXiv.org Machine Learning

Gaussian processes (GPs) are flexible distributions over functions that enable high-level assumptions about unknown functions to be encoded in a parsimonious, flexible and general way. Although elegant, the application of GPs is limited by computational and analytical intractabilities that arise when data are sufficiently numerous or when employing non-Gaussian models. Consequently, a wealth of GP approximation schemes have been developed over the last 15 years to address these key limitations. Many of these schemes employ a small set of pseudo data points to summarise the actual data. In this paper, we develop a new pseudo-point approximation framework using Power Expectation Propagation (Power EP) that unifies a large number of these pseudo-point approximations. Unlike much of the previous venerable work in this area, the new framework is built on standard methods for approximate inference (variational free-energy, EP and Power EP methods) rather than employing approximations to the probabilistic generative model itself. In this way, all of approximation is performed at `inference time' rather than at `modelling time' resolving awkward philosophical and empirical questions that trouble previous approaches. Crucially, we demonstrate that the new framework includes new pseudo-point approximation methods that outperform current approaches on regression and classification tasks.


Bayesian Learning for Statistical Classification โ€“ Stats and Bots

@machinelearnbot

A well-calibrated estimator for the conditional probabilities should obey this equation. Once we have derived a statistical classifier, we need to validate it on some test data. This data should be different from that used to train the classifier, otherwise skill scores will be unduly optimistic. This is known as cross-validation. The confusion matrix expresses everything about the accuracy of a discrete classifier over a given database and you can use it to compose any possible skill score. Here, we are going to cover two that are rarely seen in the literature, but are nonetheless important for reasons that will become clear.


Learning Functional Causal Models with Generative Neural Networks

arXiv.org Machine Learning

We introduce a new approach to functional causal modeling from observational data. The approach, called Causal Generative Neural Networks (CGNN), leverages the power of neural networks to learn a generative model of the joint distribution of the observed variables, by minimizing the Maximum Mean Discrepancy between generated and observed data. An approximate learning criterion is proposed to scale the computational cost of the approach to linear complexity in the number of observations. The performance of CGNN is studied throughout three experiments. First, we apply CGNN to the problem of cause-effect inference, where two CGNNs model $P(Y|X,\textrm{noise})$ and $P(X|Y,\textrm{noise})$ identify the best causal hypothesis out of $X\rightarrow Y$ and $Y\rightarrow X$. Second, CGNN is applied to the problem of identifying v-structures and conditional independences. Third, we apply CGNN to problem of multivariate functional causal modeling: given a skeleton describing the dependences in a set of random variables $\{X_1, \ldots, X_d\}$, CGNN orients the edges in the skeleton to uncover the directed acyclic causal graph describing the causal structure of the random variables. On all three tasks, CGNN is extensively assessed on both artificial and real-world data, comparing favorably to the state-of-the-art. Finally, we extend CGNN to handle the case of confounders, where latent variables are involved in the overall causal model.


A Mutually-Dependent Hadamard Kernel for Modelling Latent Variable Couplings

arXiv.org Machine Learning

We introduce a novel kernel that models input-dependent couplings across multiple latent processes. The pairwise joint kernel measures covariance along inputs and across different latent signals in a mutually-dependent fashion. A latent correlation Gaussian process (LCGP) model combines these non-stationary latent components into multiple outputs by an input-dependent mixing matrix. Probit classification and support for multiple observation sets are derived by Variational Bayesian inference. Results on several datasets indicate that the LCGP model can recover the correlations between latent signals while simultaneously achieving state-of-the-art performance. We highlight the latent covariances with an EEG classification dataset where latent brain processes and their couplings simultaneously emerge from the model.


Steps Toward Robust Artificial Intelligence

AI Magazine

Recent advances in artificial intelligence are encouraging governments and corporations to deploy AI in high-stakes settings including driving cars autonomously, managing the power grid, trading on stock exchanges, and controlling autonomous weapons systems. Such applications require AI methods to be robust to both the known unknowns (those uncertain aspects of the world about which the computer can reason explicitly) and the unknown unknowns (those aspects of the world that are not captured by the systemโ€™s models). This article discusses recent progress in AI and then describes eight ideas related to robustness that are being pursued within the AI research community. While these ideas are a start, we need to devote more attention to the challenges of dealing with the known and unknown unknowns. These issues are fascinating, because they touch on the fundamental question of how finite systems can survive and thrive in a complex and dangerous world


Estimating the Fundamental Limits is Easier than Achieving the Fundamental Limits

arXiv.org Machine Learning

Suppose there exist three machine learning experts that would like to understand the fundamental limits of classification (Bayes error) [1] for a specific dataset. Since the true distribution that generates the data is unknown, they take three different approaches: 1) Expert A: given empirical training samples, produce an estimate of the Bayes error that is (near) optimal statistically; 2) Expert B: construct a (near) optimal classifier based on the training sample, and then use its performance on the test set (may have infinite size) to estimate the Bayes error; 3) Expert C: use the training error of a (near) optimal classification algorithm to estimate the Bayes error. We ask the question: are there any fundamental differences between experts A, B, and C? Evidently, expert A is not constrained by any specific approaches as experts B and C are, but if B and C are using (near) optimal classification algorithms, would B or C achieve the same performance of A if A chooses to act optimally? Similar situations arise in the understanding of fundamental limits of data compression and sequential prediction under logarithmic loss, which is given by the Shannon entropy rate [2]. In this situation, there could exist four different experts: 1) A: would like to estimate the limits of compression (near) optimally; 2) B: would like to construct a predictor based on training samples and use its prediction accuracy under logarithmic loss on the test set (may have infinite size) to estimate the limits; 3) C: would like to use the training error of a (near) optimal sequential predictor to estimate the limits; 4) D: would like to construct a (near) optimal data compressor and use its normalized code length to estimate the limits. In this situation, are there any fundamental differences between the tasks of these four experts?


Learning Directed Acyclic Graphs with Penalized Neighbourhood Regression

arXiv.org Machine Learning

We study a family of regularized score-based estimators for learning the structure of a directed acyclic graph (DAG) for a multivariate normal distribution from high-dimensional data with $p\gg n$. Our main results establish support recovery guarantees and deviation bounds for a family of penalized least-squares estimators under concave regularization without assuming prior knowledge of a variable ordering. These results apply to a variety of practical situations that allow for arbitrary nondegenerate covariance structures as well as many popular regularizers including the MCP, SCAD, $\ell_{0}$ and $\ell_{1}$. The proof relies on interpreting a DAG as a recursive linear structural equation model, which reduces the estimation problem to a series of neighbourhood regressions. We provide a novel statistical analysis of these neighbourhood problems, establishing uniform control over the superexponential family of neighbourhoods associated with a Gaussian distribution. We then apply these results to study the statistical properties of score-based DAG estimators, learning causal DAGs, and inferring conditional independence relations via graphical models. Our results yield---for the first time---finite-sample guarantees for structure learning of Gaussian DAGs in high-dimensions via score-based estimation.