Several diseases related to cell proliferation are characterized by the accumulation of somatic DNA changes, with respect to wildtype conditions. Cancer and HIV are two common examples of such diseases, where the mutational load in the cancerous/viral population increases over time. In these cases, selective pressures are often observed along with competition, cooperation and parasitism among distinct cellular clones. Recently, we presented a mathematical framework to model these phenomena, based on a combination of Bayesian inference and Suppes' theory of probabilistic causation, depicted in graphical structures dubbed Suppes-Bayes Causal Networks (SBCNs). SBCNs are generative probabilistic graphical models that recapitulate the potential ordering of accumulation of such DNA changes during the progression of the disease. Such models can be inferred from data by exploiting likelihood-based model-selection strategies with regularization. In this paper we discuss the theoretical foundations of our approach and we investigate in depth the influence on the model-selection task of: (i) the poset based on Suppes' theory and (ii) different regularization strategies. Furthermore, we provide an example of application of our framework to HIV genetic data highlighting the valuable insights provided by the inferred.
One of the critical issues when adopting Bayesian networks (BNs) to model dependencies among random variables is to "learn" their structure. This is a well-known NP-hard problem in its most general and classical formulation, which is furthermore complicated by known pitfalls such as the issue of I-equivalence among different structures. In this work we restrict the investigation to a specific class of networks, i.e., those representing the dynamics of phenomena characterized by the monotonic accumulation of events. Such phenomena allow to set specific structural constraints based on Suppes' theory of probabilistic causation and, accordingly, to define constrained BNs, named Suppes-Bayes Causal Networks (SBCNs). Within this framework, we study the structure learning of SBCNs via extensive simulations with various state-of-the-art search strategies, such as canonical local search techniques and Genetic Algorithms. This investigation is intended to be an extension and an in-depth clarification of our previous works on SBCN structure learning. Among the main results, we show that Suppes' constraints do simplify the learning task, by reducing the solution search space and providing a temporal ordering on the variables, which simplifies the complications derived by I-equivalent structures. Finally, we report on tradeoffs among different optimization techniques that can be used to learn SBCNs.
The emergence and development of cancer is a consequence of the accumulation over time of genomic mutations involving a specific set of genes, which provides the cancer clones with a functional selective advantage. In this work, we model the order of accumulation of such mutations during the progression, which eventually leads to the disease, by means of probabilistic graphic models, i.e., Bayesian Networks (BNs). We investigate how to perform the task of learning the structure of such BNs, according to experimental evidence, adopting a global optimization meta-heuristics. In particular, in this work we rely on Genetic Algorithms, and to strongly reduce the execution time of the inference -- which can also involve multiple repetitions to collect statistically significant assessments of the data -- we distribute the calculations using both multi-threading and a multi-node architecture. The results show that our approach is characterized by good accuracy and specificity; we also demonstrate its feasibility, thanks to a 84x reduction of the overall execution time with respect to a traditional sequential implementation.
Many applications of causal inference, such as finding the relationship between stock prices and news reports, involve both discrete and continuous variables observed over time. Inference with these complex sets of temporal data, though, has remained difficult and required a number of simplifications. We show that recent approaches for inferring temporal relationships (represented as logical formulas) can be adapted for inference with continuous valued effects. Building on advances in logic, PCTLc (an extension of PCTL with numerical constraints) is introduced here to allow representation and inference of relationships with a mixture of discrete and continuous components. Then, finding significant relationships in the continuous case can be done using the conditional expectation of an effect, rather than its conditional probability. We evaluate this approach on both synthetically generated and actual financial market data, demonstrating that it can allow us to answer different questions than the discrete approach can.
Discrimination discovery from data is an important task aiming at identifying patterns of illegal and unethical discriminatory activities against protected-by-law groups, e.g., ethnic minorities. While any legally-valid proof of discrimination requires evidence of causality, the state-of-the-art methods are essentially correlation-based, albeit, as it is well known, correlation does not imply causation. In this paper we take a principled causal approach to the data mining problem of discrimination detection in databases. Following Suppes' probabilistic causation theory, we define a method to extract, from a dataset of historical decision records, the causal structures existing among the attributes in the data. The result is a type of constrained Bayesian network, which we dub Suppes-Bayes Causal Network (SBCN). Next, we develop a toolkit of methods based on random walks on top of the SBCN, addressing different anti-discrimination legal concepts, such as direct and indirect discrimination, group and individual discrimination, genuine requirement, and favoritism. Our experiments on real-world datasets confirm the inferential power of our approach in all these different tasks.