Collaborating Authors


Score Matched Conditional Exponential Families for Likelihood-Free Inference Machine Learning

To perform Bayesian inference for stochastic simulator models for which the likelihood is not accessible, Likelihood-Free Inference (LFI) relies on simulations from the model. Standard LFI methods can be split according to how these simulations are used: to build an explicit Surrogate Likelihood, or to accept/reject parameter values according to a measure of distance from the observations (Approximate Bayesian Computation (ABC)). In both cases, simulations are adaptively tailored to the value of the observation. Here, we generate parameter-simulation pairs from the model independently on the observation, and use them to learn a conditional exponential family likelihood approximation; to parametrize it, we use Neural Networks whose weights are tuned with Score Matching. With our likelihood approximation, we can employ MCMC for doubly intractable distributions to draw samples from the posterior for any number of observations without additional model simulations, with performance competitive to comparable approaches. Further, the sufficient statistics of the exponential family can be used as summaries in ABC, outperforming the state-of-the-art method in five different models with known likelihood. Finally, we apply our method to a challenging model from meteorology.

On Bayesian sparse canonical correlation analysis via Rayleigh quotient framework Machine Learning

Canonical correlation analysis is a statistical technique -dating back at least to [1] - that is used to maximally correlate multiple datasets for joint analysis. The technique has become a fundamental tool in biomedical research where technological advances have led to a huge number of multi-omic datasets ([2]; [3]; [4]). Over the past two decades, limited sample sizes, growing dimensionality, and the search for meaningful biological interpretations, have led to the development of sparse canonical correlation analysis ([2]), where a sparsity assumption is imposed on the canonical correlation vectors. This work falls under the topic of the Bayesian estimation of sparse canonical corrlation vectors. Model-based approaches to canonical correlation analysis were developed in the mid 2000's (see e.g., [5]), and paved the way for a Bayesian treatment of canonical correlation analysis ([6];[7]) and sparse canonical correlation analysis ([8]). However an serious shortcoming of such a Bayesian treatment is that this approach naturally requires a complete specification of the joint distribution of the data, so as to specify the likelihood function. This requirement is a serious limitation in many applications, where the data generating process is poorly understood, for example, image data.

Handling of uncertainty in medical data using machine learning and probability theory techniques: A review of 30 years (1991-2020) Artificial Intelligence

Understanding data and reaching valid conclusions are of paramount importance in the present era of big data. Machine learning and probability theory methods have widespread application for this purpose in different fields. One critically important yet less explored aspect is how data and model uncertainties are captured and analyzed. Proper quantification of uncertainty provides valuable information for optimal decision making. This paper reviewed related studies conducted in the last 30 years (from 1991 to 2020) in handling uncertainties in medical data using probability theory and machine learning techniques. Medical data is more prone to uncertainty due to the presence of noise in the data. So, it is very important to have clean medical data without any noise to get accurate diagnosis. The sources of noise in the medical data need to be known to address this issue. Based on the medical data obtained by the physician, diagnosis of disease, and treatment plan are prescribed. Hence, the uncertainty is growing in healthcare and there is limited knowledge to address these problems. We have little knowledge about the optimal treatment methods as there are many sources of uncertainty in medical science. Our findings indicate that there are few challenges to be addressed in handling the uncertainty in medical raw data and new models. In this work, we have summarized various methods employed to overcome this problem. Nowadays, application of novel deep learning techniques to deal such uncertainties have significantly increased.

Quantifying and Reducing Bias in Maximum Likelihood Estimation of Structured Anomalies Machine Learning

Anomaly estimation, or the problem of finding a subset of a dataset that differs from the rest of the dataset, is a classic problem in machine learning and data mining. In both theoretical work and in applications, the anomaly is assumed to have a specific structure defined by membership in an $\textit{anomaly family}$. For example, in temporal data the anomaly family may be time intervals, while in network data the anomaly family may be connected subgraphs. The most prominent approach for anomaly estimation is to compute the Maximum Likelihood Estimator (MLE) of the anomaly. However, it was recently observed that for some anomaly families, the MLE is an asymptotically $\textit{biased}$ estimator of the anomaly. Here, we demonstrate that the bias of the MLE depends on the size of the anomaly family. We prove that if the number of sets in the anomaly family that contain the anomaly is sub-exponential, then the MLE is asymptotically unbiased. At the same time, we provide empirical evidence that the converse is also true: if the number of such sets is exponential, then the MLE is asymptotically biased. Our analysis unifies a number of earlier results on the bias of the MLE for specific anomaly families, including intervals, submatrices, and connected subgraphs. Next, we derive a new anomaly estimator using a mixture model, and we empirically demonstrate that our estimator is asymptotically unbiased regardless of the size of the anomaly family. We illustrate the benefits of our estimator on both simulated disease outbreak data and a real-world highway traffic dataset.

Artificial Intelligence for Social Good: A Survey Artificial Intelligence

Its impact is drastic and real: Youtube's AIdriven recommendation system would present sports videos for days if one happens to watch a live baseball game on the platform [1]; email writing becomes much faster with machine learning (ML) based auto-completion [2]; many businesses have adopted natural language processing based chatbots as part of their customer services [3]. AI has also greatly advanced human capabilities in complex decision-making processes ranging from determining how to allocate security resources to protect airports [4] to games such as poker [5] and Go [6]. All such tangible and stunning progress suggests that an "AI summer" is happening. As some put it, "AI is the new electricity" [7]. Meanwhile, in the past decade, an emerging theme in the AI research community is the so-called "AI for social good" (AI4SG): researchers aim at developing AI methods and tools to address problems at the societal level and improve the wellbeing of the society.

ABCDP: Approximate Bayesian Computation Meets Differential Privacy Machine Learning

We develop a novel approximate Bayesian computation (ABC) framework, ABCDP, that obeys the notion of differential privacy (DP). Under our framework, simply performing ABC inference with a mild modification yields differentially private posterior samples. We theoretically analyze the interplay between the ABC similarity threshold $\epsilon_{abc}$ (for comparing the similarity between real and simulated data) and the resulting privacy level $\epsilon_{dp}$ of the posterior samples, in two types of frequently-used ABC algorithms. We apply ABCDP to simulated data as well as privacy-sensitive real data. The results suggest that tuning the similarity threshold $\epsilon_{abc}$ helps us obtain better privacy and accuracy trade-off.

Non-bifurcating phylogenetic tree inference via the adaptive LASSO Machine Learning

Phylogenetic tree inference using deep DNA sequencing is reshaping our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including "sampled ancestors" in which we sequence a genotype along with its direct descendants, and "polytomies" in which multiple descendants arise simultaneously. These features are apparent after identifying zero-length branches in the tree. However, current maximum-likelihood based approaches are not capable of revealing such zero-length branches. In this paper, we find these zero-length branches by introducing adaptive-LASSO-type regularization estimators to phylogenetics, deriving their properties, and showing regularization to be a practically useful approach for phylogenetics.

Overview of Approximate Bayesian Computation Machine Learning

This Chapter, "Overview of Approximate Bayesian Computation", is to appear as the first chapter in the forthcoming Handbook of Approximate Bayesian Computation (2018). It details the main ideas and concepts behind ABC methods with many examples and illustrations.

Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data Machine Learning

Flow cytometry is a high-throughput technology used to quantify multiple surface and intracellular markers at the level of a single cell. This enables to identify cell sub-types, and to determine their relative proportions. Improvements of this technology allow to describe millions of individual cells from a blood sample using multiple markers. This results in high-dimensional datasets, whose manual analysis is highly time-consuming and poorly reproducible. While several methods have been developed to perform automatic recognition of cell populations, most of them treat and analyze each sample independently. However, in practice, individual samples are rarely independent (e.g. longitudinal studies). Here, we propose to use a Bayesian nonparametric approach with Dirichlet process mixture (DPM) of multivariate skew $t$-distributions to perform model based clustering of flow-cytometry data. DPM models directly estimate the number of cell populations from the data, avoiding model selection issues, and skew $t$-distributions provides robustness to outliers and non-elliptical shape of cell populations. To accommodate repeated measurements, we propose a sequential strategy relying on a parametric approximation of the posterior. We illustrate the good performance of our method on simulated data, on an experimental benchmark dataset, and on new longitudinal data from the DALIA-1 trial which evaluates a therapeutic vaccine against HIV. On the benchmark dataset, the sequential strategy outperforms all other methods evaluated, and similarly, leads to improved performance on the DALIA-1 data. We have made the method available for the community in the R package NPflow.