Methods for learning Bayesian network structure can discover dependency structure between observed variables, and have been shown to be useful in many applications. However, in domains that involve a large number of variables, the space of possible network structures is enormous, making it difficult, for both computational and statistical reasons, to identify a good model. In this paper, we consider a solution to this problem, suitable for domains where many variables have similar behavior. Our method is based on a new class of models, which we call module networks. A module network explicitly represents the notion of a module - a set of variables that have the same parents in the network and share the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns a module network from data. The algorithm learns both the partitioning of the variables into modules and the dependency structure between the variables. We evaluate our algorithm on synthetic data, and on real data in the domains of gene expression and the stock market. Our results show that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks.
This paper introduces an information theoretic approach to verification of modular causal probabilistic models. We assume systems which are gradually extended by adding new functional modules, each having a limited domain knowledge captured by a local Bayesian network. Different modules originate from independent design processes. We assume that the local models are correct, which, however does not guarantee globally coherent inference in composed systems. The introduced method supports discovery of significant inter module dependencies which are ignored in the assembled Bayesian network.
We present an efficient, principled, and interpretable technique for inferring module assignments and for identifying the optimal number of modules in a given network. We show how several existing methods for finding modules can be described as variant, special, or limiting cases of our work, and how the method overcomes the resolution limit problem, accurately recovering the true number of modules. Our approach is based on Bayesian methods for model selection which have been used with success for almost a century, implemented using a variational technique developed only in the past decade. We apply the technique to synthetic and real networks and outline how the method naturally allows selection among competing models.
The linking genotype to phenotype is the fundamental aim of modern genetics. We focus on study of links between gene expression data and phenotype data through integrative analysis. We propose three approaches. 1) The inherent complexity of phenotypes makes high-throughput phenotype profiling a very difficult and laborious process. We propose a method of automated multi-dimensional profiling which uses gene expression similarity. Large-scale analysis show that our method can provide robust profiling that reveals different phenotypic aspects of samples. This profiling technique is also capable of interpolation and extrapolation beyond the phenotype information given in training data. It can be used in many applications, including facilitating experimental design and detecting confounding factors. 2) Phenotype association analysis problems are complicated by small sample size and high dimensionality. Consequently, phenotype-associated gene subsets obtained from training data are very sensitive to selection of training samples, and the constructed sample phenotype classifiers tend to have poor generalization properties. To eliminate these obstacles, we propose a novel approach that generates sequences of increasingly discriminative gene cluster combinations. Our experiments on both simulated and real datasets show robust and accurate classification performance. 3) Many complex phenotypes, such as cancer, are the product of not only gene expression, but also gene interaction. We propose an integrative approach to find gene network modules that activate under different phenotype conditions. Using our method, we discovered cancer subtype-specific network modules, as well as the ways in which these modules coordinate. In particular, we detected a breast-cancer specific tumor suppressor network module with a hub gene, PDGFRL, which may play an important role in this module.
Microbes are everywhere, including in and on our bodies, and have been shown to play key roles in a variety of prevalent human diseases. Consequently, there has been intense interest in the design of bacteriotherapies or "bugs as drugs," which are communities of bacteria administered to patients for specific therapeutic applications. Central to the design of such therapeutics is an understanding of the causal microbial interaction network and the population dynamics of the organisms. In this work we present a Bayesian nonparametric model and associated efficient inference algorithm that addresses the key conceptual and practical challenges of learning microbial dynamics from time series microbe abundance data. These challenges include high-dimensional (300+ strains of bacteria in the gut) but temporally sparse and non-uniformly sampled data; high measurement noise; and, nonlinear and physically non-negative dynamics. Our contributions include a new type of dynamical systems model for microbial dynamics based on what we term interaction modules, or learned clusters of latent variables with redundant interaction structure (reducing the expected number of interaction coefficients from $O(n^2)$ to $O((\log n)^2)$); a fully Bayesian formulation of the stochastic dynamical systems model that propagates measurement and latent state uncertainty throughout the model; and introduction of a temporally varying auxiliary variable technique to enable efficient inference by relaxing the hard non-negativity constraint on states. We apply our method to simulated and real data, and demonstrate the utility of our technique for system identification from limited data and gaining new biological insights into bacteriotherapy design.