Pacchiardi, Lorenzo, Dutta, Ritabrata

To perform Bayesian inference for stochastic simulator models for which the likelihood is not accessible, Likelihood-Free Inference (LFI) relies on simulations from the model. Standard LFI methods can be split according to how these simulations are used: to build an explicit Surrogate Likelihood, or to accept/reject parameter values according to a measure of distance from the observations (Approximate Bayesian Computation (ABC)). In both cases, simulations are adaptively tailored to the value of the observation. Here, we generate parameter-simulation pairs from the model independently on the observation, and use them to learn a conditional exponential family likelihood approximation; to parametrize it, we use Neural Networks whose weights are tuned with Score Matching. With our likelihood approximation, we can employ MCMC for doubly intractable distributions to draw samples from the posterior for any number of observations without additional model simulations, with performance competitive to comparable approaches. Further, the sufficient statistics of the exponential family can be used as summaries in ABC, outperforming the state-of-the-art method in five different models with known likelihood. Finally, we apply our method to a challenging model from meteorology.

Dutta, Ritabrata, Zouaoui-Boudjeltia, Karim, Kotsalos, Christos, Rousseau, Alexandre, de Sousa, Daniel Ribeiro, Desmet, Jean-Marc, Van Meerhaeghe, Alain, Mira, Antonietta, Chopard, Bastien

Cardio/cerebrovascular diseases (CVD) were the first cause of mortality worldwide in 2015, causing 31% of deaths according to World Health Organization [Organization, 2015]. Blood platelets play a key role in the occurrence of these cardio/cerebrovascular accidents in addition to complex process of blood coagulation, involving adhesion, aggregation and spreading on the vascular wall to stop a hemorrhage while avoiding the vessel occlusion. Although, in a recent biomedical evaluation study by Breet et al. [2010], the correlation between the clinical biological measures using platelet function tests and the occurrence of a cardiovascular event was found to be null for half of the techniques and rather modest for others, indicating the evident need for a more efficient tool or method to monitor patient platelet functionalities. This may be due to the fact that no current test allows the analysis of the different stages of platelet activation or the prediction of the in-vivo behavior of those platelets [Picker, 2011, Koltai et al., 2017]. In addition, the current clinical tests do not take into account the dynamic aspect of the process of platelet aggregation formation and the role that red blood cells can have in this process. To address these issues, Chopard et al. [2017b] provided a physical description of the adhesion and aggregation of platelets in the Impact-R device, by combining digital holography microscopy and mathematical modeling. They have developed a numerical model that quantitatively describes how platelets in a shear flow adhere and aggregate on a deposition surface. Further Dutta et al. [2018] showed how the five parameters of this model, specifying the deposition process and relevant for biomedical understanding of the phenomena, can be inferred from the blood sample collected from an individual using approximate Bayesian computation (ABC) [Lintusaari et al., 2017]. Our main claim here is that the values of some these parameters (eg.

Dutta, Ritabrata, Mira, Antonietta, Onnela, Jukka-Pekka

Infectious diseases are studied to understand their spreading mechanisms, to evaluate control strategies and to predict the risk and course of future outbreaks. Because people only interact with a small number of individuals, and because the structure of these interactions matters for spreading processes, the pairwise relationships between individuals in a population can be usefully represented by a network. Although the underlying processes of transmission are different, the network approach can be used to study the spread of pathogens in a contact network or the spread of rumors in an online social network. We study simulated simple and complex epidemics on synthetic networks and on two empirical networks, a social / contact network in an Indian village and an online social network in the U.S. Our goal is to learn simultaneously about the spreading process parameters and the source node (first infected node) of the epidemic, given a fixed and known network structure, and observations about state of nodes at several points in time. Our inference scheme is based on approximate Bayesian computation (ABC), an inference technique for complex models with likelihood functions that are either expensive to evaluate or analytically intractable. ABC enables us to adopt a Bayesian approach to the problem despite the posterior distribution being very complex. Our method is agnostic about the topology of the network and the nature of the spreading process. It generally performs well and, somewhat counter-intuitively, the inference problem appears to be easier on more heterogeneous network topologies, which enhances its future applicability to real-world settings where few networks have homogeneous topologies.

Dutta, Ritabrata, Corander, Jukka, Kaski, Samuel, Gutmann, Michael U.

We consider the problem of parametric statistical inference when likelihood computations are prohibitively expensive but sampling from the model is possible. Several so-called likelihood-free methods have been developed to perform inference in the absence of a likelihood function. The popular synthetic likelihood approach infers the parameters by modelling summary statistics of the data by a Gaussian probability distribution. In another popular approach called approximate Bayesian computation, the inference is performed by identifying parameter values for which the summary statistics of the simulated data are close to those of the observed data. Synthetic likelihood is easier to use as no measure of "closeness" is required but the Gaussianity assumption is often limiting. Moreover, both approaches require judiciously chosen summary statistics. We here present an alternative inference approach that is as easy to use as synthetic likelihood but not as restricted in its assumptions, and that, in a natural way, enables automatic selection of relevant summary statistic from a large set of candidates. The basic idea is to frame the problem of estimating the posterior as a problem of estimating the ratio between the data generating distribution and the marginal distribution. This problem can be solved by logistic regression, and including regularising penalty terms enables automatic selection of the summary statistics relevant to the inference task. We illustrate the general theory on toy problems and use it to perform inference for stochastic nonlinear dynamical systems.

Gutmann, Michael U., Dutta, Ritabrata, Kaski, Samuel, Corander, Jukka

Increasingly complex generative models are being used across disciplines as they allow for realistic characterization of data, but a common difficulty with them is the prohibitively large computational cost to evaluate the likelihood function and thus to perform likelihood-based statistical inference. A likelihood-free inference framework has emerged where the parameters are identified by finding values that yield simulated data resembling the observed data. While widely applicable, a major difficulty in this framework is how to measure the discrepancy between the simulated and observed data. Transforming the original problem into a problem of classifying the data into simulated versus observed, we find that classification accuracy can be used to assess the discrepancy. The complete arsenal of classification methods becomes thereby available for inference of intractable generative models. We validate our approach using theory and simulations for both point estimation and Bayesian inference, and demonstrate its use on real data by inferring an individual-based epidemiological model for bacterial infections in child care centers.

Dutta, Ritabrata, Blomstedt, Paul, Kaski, Samuel

Hierarchical models are versatile tools for joint modeling of data sets arising from different, but related, sources. Fully Bayesian inference may, however, become computationally prohibitive if the source-specific data models are complex, or if the number of sources is very large. To facilitate computation, we propose an approach, where inference is first made independently for the parameters of each data set, whereupon the obtained posterior samples are used as observed data in a substitute hierarchical model, based on a scaled likelihood function. Compared to direct inference in a full hierarchical model, the approach has the advantage of being able to speed up convergence by breaking down the initial large inference problem into smaller individual subproblems with better convergence properties. Moreover it enables parallel processing of the possibly complex inferences of the source-specific parameters, which may otherwise create a computational bottleneck if processed jointly as part of a hierarchical model. The approach is illustrated with both simulated and real data.

Blomstedt, Paul, Dutta, Ritabrata, Seth, Sohan, Brazma, Alvis, Kaski, Samuel

Motivation: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case vs. control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. Results: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. The suggested metric for retrieval using clusterings is the normalized information distance. Empirical results finally suggest that inference for the full probabilistic model can be approximated with good performance using computationally faster heuristic clustering approaches (e.g. $k$-means). The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method. Availability: The method can be implemented using standard clustering algorithms and normalized information distance, available in many statistical software packages.

Gutmann, Michael U., Corander, Jukka, Dutta, Ritabrata, Kaski, Samuel

Some statistical models are specified via a data generating process for which the likelihood function cannot be computed in closed form. Standard likelihood-based inference is then not feasible but the model parameters can be inferred by finding the values which yield simulated data that resemble the observed data. This approach faces at least two major difficulties: The first difficulty is the choice of the discrepancy measure which is used to judge whether the simulated data resemble the observed data. The second difficulty is the computationally efficient identification of regions in the parameter space where the discrepancy is low. We give here an introduction to our recent work where we tackle the two difficulties through classification and Bayesian optimization.

Dutta, Ritabrata, Seth, Sohan, Kaski, Samuel

We address the problem of retrieving relevant experiments given a query experiment, motivated by the public databases of datasets in molecular biology and other experimental sciences, and the need of scientists to relate to earlier work on the level of actual measurement data. Since experiments are inherently noisy and databases ever accumulating, we argue that a retrieval engine should possess two particular characteristics. First, it should compare models learnt from the experiments rather than the raw measurements themselves: this allows incorporating experiment-specific prior knowledge to suppress noise effects and focus on what is important. Second, it should be updated sequentially from newly published experiments, without explicitly storing either the measurements or the models, which is critical for saving storage space and protecting data privacy: this promotes life long learning. We formulate the retrieval as a ``supermodelling'' problem, of sequentially learning a model of the set of posterior distributions, represented as sets of MCMC samples, and suggest the use of Particle-Learning-based sequential Dirichlet process mixture (DPM) for this purpose. The relevance measure for retrieval is derived from the supermodel through the mixture representation. We demonstrate the performance of the proposed retrieval method on simulated data and molecular biological experiments.