Graphical modelling has a long history in statistics as a tool for the analysis of multivariate data, starting from Wright's path analysis and Gibbs' applications to statistical physics at the beginning of the last century. In its modern form, it was pioneered by Lauritzen and Wermuth and Pearl in the 1980s, and has since found applications in fields as diverse as bioinformatics, customer satisfaction surveys and weather forecasts. Genetics and systems biology are unique among these fields in the dimension of the data sets they study, which often contain several hundreds of variables and only a few tens or hundreds of observations. This raises problems in both computational complexity and the statistical significance of the resulting networks, collectively known as the "curse of dimensionality". Furthermore, the data themselves are difficult to model correctly due to the limited understanding of the underlying mechanisms. In the following, we will illustrate how such challenges affect practical graphical modelling and some possible solutions.
Probabilistic graphical models (PGMs) have become a popular tool for computational analysis of biological data in a variety of domains. But, what exactly are they and how do they work? How can we use PGMs to discover patterns that are biologically relevant? And to what extent can PGMs help us formulate new hypotheses that are testable at the bench? This note sketches out some answers and illustrates the main ideas behind the statistical approach to biological pattern discovery.
In previous discussions of Bayesian Inference we introduced Bayesian Statistics and considered how to infer a binomial proportion using the concept of conjugate priors. We discussed the fact that not all models can make use of conjugate priors and thus calculation of the posterior distribution would need to be approximated numerically. In this article we introduce the main family of algorithms, known collectively as Markov Chain Monte Carlo (MCMC), that allow us to approximate the posterior distribution as calculated by Bayes' Theorem. In particular, we consider the Metropolis Algorithm, which is easily stated and relatively straightforward to understand. It serves as a useful starting point when learning about MCMC before delving into more sophisticated algorithms such as Metropolis-Hastings, Gibbs Samplers and Hamiltonian Monte Carlo. Once we have described how MCMC works, we will carry it out using the open-source PyMC3 library, which takes care of many of the underlying implementation details, allowing us to concentrate on Bayesian modelling.
The aim of this chapter is twofold. In the first part we will provide a brief overview of the mathematical and statistical foundations of graphical models, along with their fundamental properties, estimation and basic inference procedures. In particular we will develop Markov networks (also known as Markov random fields) and Bayesian networks, which comprise most past and current literature on graphical models. In the second part we will review some applications of graphical models in systems biology.
A hybrid data exploration and modeling method that combines multi-way recursive partitioning with the probabilistic reasoning of Bayesian networks is presented. This hybrid method uses the feature extraction capabilities of recursive partitioning to explore the data and construct the network. This manner of feature extraction has the advantage of being able to handle real, raw data sets, which typically have many more features (not all informative) than samples. The resulting network's uncertain/probabilistic reasoning, and semantic and statistical justification qualities provide the user with a strong predictive ability and understanding of the domain. This method is able to accommodate both continuous and discrete variables, missing data, and non-independent features. In addition, no assumptions are made regarding the underlying structure(s) within the data. Given its strong predictive ability, data handling and information extraction capabilities, and its statistical and semantic justification, applications such as QSAR, risk assessment, and toxicological evaluations could benefit from this method.