The concept of causality has a controversial history. The question of whether it is possible to represent and address causal problems with probability theory, or if fundamentally new mathematics such as the do-calculus is required has been hotly debated, In this paper we demonstrate that, while it is critical to explicitly model our assumptions on the impact of intervening in a system, provided we do so, estimating causal effects can be done entirely within the standard Bayesian paradigm. The invariance assumptions underlying causal graphical models can be encoded in ordinary Probabilistic graphical models, allowing causal estimation with Bayesian statistics, equivalent to the do-calculus.
Graphical modelling has a long history in statistics as a tool for the analysis of multivariate data, starting from Wright's path analysis and Gibbs' applications to statistical physics at the beginning of the last century. In its modern form, it was pioneered by Lauritzen and Wermuth and Pearl in the 1980s, and has since found applications in fields as diverse as bioinformatics, customer satisfaction surveys and weather forecasts. Genetics and systems biology are unique among these fields in the dimension of the data sets they study, which often contain several hundreds of variables and only a few tens or hundreds of observations. This raises problems in both computational complexity and the statistical significance of the resulting networks, collectively known as the "curse of dimensionality". Furthermore, the data themselves are difficult to model correctly due to the limited understanding of the underlying mechanisms. In the following, we will illustrate how such challenges affect practical graphical modelling and some possible solutions.
Probabilistic graphical models (PGMs) have become a popular tool for computational analysis of biological data in a variety of domains. But, what exactly are they and how do they work? How can we use PGMs to discover patterns that are biologically relevant? And to what extent can PGMs help us formulate new hypotheses that are testable at the bench? This note sketches out some answers and illustrates the main ideas behind the statistical approach to biological pattern discovery.
In previous discussions of Bayesian Inference we introduced Bayesian Statistics and considered how to infer a binomial proportion using the concept of conjugate priors. We discussed the fact that not all models can make use of conjugate priors and thus calculation of the posterior distribution would need to be approximated numerically. In this article we introduce the main family of algorithms, known collectively as Markov Chain Monte Carlo (MCMC), that allow us to approximate the posterior distribution as calculated by Bayes' Theorem. In particular, we consider the Metropolis Algorithm, which is easily stated and relatively straightforward to understand. It serves as a useful starting point when learning about MCMC before delving into more sophisticated algorithms such as Metropolis-Hastings, Gibbs Samplers and Hamiltonian Monte Carlo. Once we have described how MCMC works, we will carry it out using the open-source PyMC3 library, which takes care of many of the underlying implementation details, allowing us to concentrate on Bayesian modelling.
The aim of this chapter is twofold. In the first part we will provide a brief overview of the mathematical and statistical foundations of graphical models, along with their fundamental properties, estimation and basic inference procedures. In particular we will develop Markov networks (also known as Markov random fields) and Bayesian networks, which comprise most past and current literature on graphical models. In the second part we will review some applications of graphical models in systems biology.