A typical problem in causal modeling is the instability of model structure learning, i.e., small changes in finite data can result in completely different optimal models. The present work introduces a novel causal modeling algorithm for longitudinal data, that is robust for finite samples based on recent advances in stability selection using subsampling and selection algorithms. Our approach uses exploratory search but allows incorporation of prior knowledge, e.g., the absence of a particular causal relationship between two specific variables. We represent causal relationships using structural equation models. Models are scored along two objectives: the model fit and the model complexity. Since both objectives are often conflicting we apply a multi-objective evolutionary algorithm to search for Pareto optimal models. To handle the instability of small finite data samples, we repeatedly subsample the data and select those substructures (from the optimal models) that are both stable and parsimonious. These substructures can be visualized through a causal graph. Our more exploratory approach achieves at least comparable performance as, but often a significant improvement over state-of-the-art alternative approaches on a simulated data set with a known ground truth. We also present the results of our method on three real-world longitudinal data sets on chronic fatigue syndrome, Alzheimer disease, and chronic kidney disease. The findings obtained with our approach are generally in line with results from more hypothesis-driven analyses in earlier studies and suggest some novel relationships that deserve further research.
Adequate problem representations require the identification of abstractions and approximations that are well suited to the task at hand. In this paper we introduce a new class of approximations, called cuusal approximations, that are commonly found in modeling the physical world. Causal approximations support the efficient generation of parsimonious causal explanations, which play an important role in reasoning about engineered devices. The central problem to be solved in generating parsimonious causal explanations is the identification of a simplest model that explains the phenomenon of interest. We formalize this problem and show that it is, in general, intractable. In this formalization, simplicity of models is based on the intuition that using more approximate models of fewer phenomena leads to simpler models. We then show that when all the approximations are causal approximations, the above problem can be solved in polynomial time.
By taking into account the nonlinear effect of the cause, the inner noise effect, and the measurement distortion effect in the observed variables, the post-nonlinear (PNL) causal model has demonstrated its excellent performance in distinguishing the cause from effect. However, its identifiability has not been properly addressed, and how to apply it in the case of more than two variables is also a problem. In this paper, we conduct a systematic investigation on its identifiability in the two-variable case. We show that this model is identifiable in most cases; by enumerating all possible situations in which the model is not identifiable, we provide sufficient conditions for its identifiability. Simulations are given to support the theoretical results. Moreover, in the case of more than two variables, we show that the whole causal structure can be found by applying the PNL causal model to each structure in the Markov equivalent class and testing if the disturbance is independent of the direct causes for each variable. In this way the exhaustive search over all possible causal structures is avoided.
Causal modeling has long been an attractive topic for many researchers and in recent decades there has seen a surge in theoretical development and discovery algorithms. Generally discovery algorithms can be divided into two approaches: constraint-based and score-based. The constraint-based approach is able to detect common causes of the observed variables but the use of independence tests makes it less reliable. The score-based approach produces a result that is easier to interpret as it also measures the reliability of the inferred causal relationships, but it is unable to detect common confounders of the observed variables. A drawback of both score-based and constrained-based approaches is the inherent instability in structure estimation. With finite samples small changes in the data can lead to completely different optimal structures. The present work introduces a new hypothesis-free score-based causal discovery algorithm, called stable specification search, that is robust for finite samples based on recent advances in stability selection using subsampling and selection algorithms. Structure search is performed over Structural Equation Models. Our approach uses exploratory search but allows incorporation of prior background knowledge. We validated our approach on one simulated data set, which we compare to the known ground truth, and two real-world data sets for Chronic Fatigue Syndrome and Attention Deficit Hyperactivity Disorder, which we compare to earlier medical studies. The results on the simulated data set show significant improvement over alternative approaches and the results on the real-word data sets show consistency with the hypothesis driven models constructed by medical experts.
A long-standing open research problem is how to use information from different experiments, including background knowledge, to infer causal relations. Recent developments have shown ways to use multiple data sets, provided they originate from identical experiments. We present the MCI-algorithm as the first method that can infer provably valid causal relations in the large sample limit from different experiments. It is fast, reliable and produces very clear and easily interpretable output. It is based on a result that shows that constraint-based causal discovery is decomposable into a candidate pair identification and subsequent elimination step that can be applied separately from different models. We test the algorithm on a variety of synthetic input model sets to assess its behavior and the quality of the output. The method shows promising signs that it can be adapted to suit causal discovery in real-world application areas as well, including large databases.