Peters, Jonas
Learning by Doing: Controlling a Dynamical System using Causality, Control, and Reinforcement Learning
Weichwald, Sebastian, Mogensen, Søren Wengel, Lee, Tabitha Edith, Baumann, Dominik, Kroemer, Oliver, Guyon, Isabelle, Trimpe, Sebastian, Peters, Jonas, Pfister, Niklas
Questions in causality, control, and reinforcement learning go beyond the classical machine learning task of prediction under i.i.d. observations. Instead, these fields consider the problem of learning how to actively perturb a system to achieve a certain effect on a response variable. Arguably, they have complementary views on the problem: In control, one usually aims to first identify the system by excitation strategies to then apply model-based design techniques to control the system. In (non-model-based) reinforcement learning, one directly optimizes a reward. In causality, one focus is on identifiability of causal structure. We believe that combining the different views might create synergies and this competition is meant as a first step toward such synergies. The participants had access to observational and (offline) interventional data generated by dynamical systems. Track CHEM considers an open-loop problem in which a single impulse at the beginning of the dynamics can be set, while Track ROBO considers a closed-loop problem in which control variables can be set at each time step. The goal in both tracks is to infer controls that drive the system to a desired state. Code is open-sourced ( https://github.com/LearningByDoingCompetition/learningbydoing-comp ) to reproduce the winning solutions of the competition and to facilitate trying out new methods on the competition tasks.
Exploiting Independent Instruments: Identification and Distribution Generalization
Saengkyongam, Sorawit, Henckel, Leonard, Pfister, Niklas, Peters, Jonas
When estimating the causal function between a vector of covariates X and a response Y in the presence of unobserved confounding, standard regression procedures such as ordinary least squares (OLS) are even asymptotically biased. Instrumental variable approaches (Wright, 1928; Imbens and Angrist, 1994; Newey, 2013) exploit the existence of exogenous heterogeneity in the form of an instrumental variable (IV) Z and estimate, under suitable conditions, the causal function consistently. Importantly, the errors in Y and the hidden confounders U should be uncorrelated with the instruments Z. Usually, this has to be argued for with background knowledge. When the data generating process is modeled by a structural causal model (SCM) (Pearl, 2009; Bongers et al., 2021) (so that the distribution is Markov with respect to the induced graph), then the above condition is satisfied if Y and U are d-separated from Z in the graph obtained by removing the edge from X to Y. Furthermore, in this case the errors in Y and U are even independent from Z. Using that the errors and instruments are not only uncorrelated but also independent comes with several benefits. For example, even in settings, where the causal function can be identified by classical approaches based on uncorrelatedness, the independence can be exploited to construct estimators that achieve the semiparametric efficiency bound, at least when the error distribution comes from a known, parametric family (Hansen et al., 2010). Furthermore, the independence constraint is stronger than uncorrelatedness and therefore yields stronger identifiability results, which has been reported in the field of econometrics (e.g., Imbens and Newey, 2009; Chesher, 2003).
Structure Learning for Directed Trees
Jakobsen, Martin Emil, Shah, Rajen D., Bühlmann, Peter, Peters, Jonas
Knowing the causal structure of a system is of fundamental interest in many areas of science and can aid the design of prediction algorithms that work well under manipulations to the system. The causal structure becomes identifiable from the observational distribution under certain restrictions. To learn the structure from data, score-based methods evaluate different graphs according to the quality of their fits. However, for large nonlinear models, these rely on heuristic optimization approaches with no general guarantees of recovering the true causal structure. In this paper, we consider structure learning of directed trees. We propose a fast and scalable method based on Chu-Liu-Edmonds' algorithm we call causal additive trees (CAT). For the case of Gaussian errors, we prove consistency in an asymptotic regime with a vanishing identifiability gap. We also introduce a method for testing substructure hypotheses with asymptotic family-wise error rate control that is valid post-selection and in unidentified settings. Furthermore, we study the identifiability gap, which quantifies how much better the true causal model fits the observational distribution, and prove that it is lower bounded by local properties of the causal model. Simulation studies demonstrate the favorable performance of CAT compared to competing structure learning methods.
Invariant Policy Learning: A Causal Perspective
Saengkyongam, Sorawit, Thams, Nikolaj, Peters, Jonas, Pfister, Niklas
In the past decade, contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in high-stakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over time or over different environments. In many real world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we tackle the problem of environmental shifts under the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multi-environment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved confounders are present and show that, in that case, an optimal invariant policy is guaranteed, under certain assumptions, to generalize across environments. Our results do not only provide a solution to the environmental shift problem but also establish concrete connections among causality, invariance and contextual bandits.
Regularizing towards Causal Invariance: Linear Models with Proxies
Oberst, Michael, Thams, Nikolaj, Peters, Jonas, Sontag, David
We propose a method for learning linear models whose predictive performance is robust to causal interventions on unobserved variables, when noisy proxies of those variables are available. Our approach takes the form of a regularization term that trades off between in-distribution performance and robustness to interventions. Under the assumption of a linear structural causal model, we show that a single proxy can be used to create estimators that are prediction optimal under interventions of bounded strength. This strength depends on the magnitude of the measurement noise in the proxy, which is, in general, not identifiable. In the case of two proxy variables, we propose a modified estimator that is prediction optimal under interventions up to a known strength. We further show how to extend these estimators to scenarios where additional information about the "test time" intervention is available during training. We evaluate our theoretical findings in synthetic experiments and using real data of hourly pollution levels across several cities in China.
Distributional Robustness of K-class Estimators and the PULSE
Jakobsen, Martin Emil, Peters, Jonas
Recently, in causal discovery, invariance properties such as the moment criterion which two-stage least square estimator leverage have been exploited for causal structure learning: e.g., in cases, where the causal parameter is not identifiable, some structure of the non-zero components may be identified, and coverage guarantees are available. Subsequently, anchor regression has been proposed to trade-off invariance and predictability. The resulting estimator is shown to have optimal predictive performance under bounded shift interventions. In this paper, we show that the concepts of anchor regression and K-class estimators are closely related. Establishing this connection comes with two benefits: (1) It enables us to prove robustness properties for existing K-class estimators when considering distributional shifts. And, (2), we propose a novel estimator in instrumental variable settings by minimizing the mean squared prediction error subject to the constraint that the estimator lies in an asymptotically valid confidence region of the causal parameter. We call this estimator PULSE (p-uncorrelated least squares estimator) and show that it can be computed efficiently, even though the underlying optimization problem is non-convex. We further prove that it is consistent. We perform simulation experiments illustrating that there are several settings including weak instrument settings, where PULSE outperforms other estimators and suffers from less variability.
Theoretical Aspects of Cyclic Structural Causal Models
Bongers, Stephan, Peters, Jonas, Schölkopf, Bernhard, Mooij, Joris M.
Structural causal models (SCMs), also known as (non-parametric) structural equation models (SEMs), are widely used for causal modeling purposes. A large body of theoretical results is available for the special case in which cycles are absent (i.e., acyclic SCMs, also known as recursive SEMs). However, in many application domains cycles are abundantly present, for example in the form of feedback loops. In this paper, we provide a general and rigorous theory of cyclic SCMs. The paper consists of two parts: the first part gives a rigorous treatment of structural causal models, dealing with measure-theoretic and other complications that arise in the presence of cycles. In contrast with the acyclic case, in cyclic SCMs solutions may no longer exist, or if they exist, they may no longer be unique, or even measurable in general. We give several sufficient and necessary conditions for the existence of (unique) measurable solutions. We show how causal reasoning proceeds in these models and how this differs from the acyclic case. Moreover, we give an overview of the Markov properties that hold for cyclic SCMs. In the second part, we address the question of how one can marginalize an SCM (possibly with cycles) to a subset of the endogenous variables. We show that under a certain condition, one can effectively remove a subset of the endogenous variables from the model, leading to a more parsimonious marginal SCM that preserves the causal and counterfactual semantics of the original SCM on the remaining variables. Moreover, we show how the marginalization relates to the latent projection and to latent confounders, i.e. latent common causes.
Kernel-based Tests for Joint Independence
Pfister, Niklas, Bühlmann, Peter, Schölkopf, Bernhard, Peters, Jonas
We investigate the problem of testing whether $d$ random variables, which may or may not be continuous, are jointly (or mutually) independent. Our method builds on ideas of the two variable Hilbert-Schmidt independence criterion (HSIC) but allows for an arbitrary number of variables. We embed the $d$-dimensional joint distribution and the product of the marginals into a reproducing kernel Hilbert space and define the $d$-variable Hilbert-Schmidt independence criterion (dHSIC) as the squared distance between the embeddings. In the population case, the value of dHSIC is zero if and only if the $d$ variables are jointly independent, as long as the kernel is characteristic. Based on an empirical estimate of dHSIC, we define three different non-parametric hypothesis tests: a permutation test, a bootstrap test and a test based on a Gamma approximation. We prove that the permutation test achieves the significance level and that the bootstrap test achieves pointwise asymptotic significance level as well as pointwise asymptotic consistency (i.e., it is able to detect any type of fixed dependence in the large sample limit). The Gamma approximation does not come with these guarantees; however, it is computationally very fast and for small $d$, it performs well in practice. Finally, we apply the test to a problem in causal discovery.
BACKSHIFT: Learning causal cyclic graphs from unknown shift interventions
Rothenhäusler, Dominik, Heinze, Christina, Peters, Jonas, Meinshausen, Nicolai
We propose a simple method to learn linear causal cyclic models in the presence of latent variables. The method relies on equilibrium data of the model recorded under a specific kind of interventions (``shift interventions''). The location and strength of these interventions do not have to be known and can be estimated from the data. Our method, called BACKSHIFT, only uses second moments of the data and performs simple joint matrix diagonalization, applied to differences between covariance matrices. We give a sufficient and necessary condition for identifiability of the system, which is fulfilled almost surely under some quite general assumptions if and only if there are at least three distinct experimental settings, one of which can be pure observational data. We demonstrate the performance on some simulated data and applications in flow cytometry and financial time series.
Distinguishing cause from effect using observational data: methods and benchmarks
Mooij, Joris M., Peters, Jonas, Janzing, Dominik, Zscheischler, Jakob, Schölkopf, Bernhard
The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X, Y. An example is to decide whether altitude causes temperature, or vice versa, given only joint measurements of both variables. Even under the simplifying assumptions of no confounding, no feedback loops, and no selection bias, such bivariate causal discovery problems are challenging. Nevertheless, several approaches for addressing those problems have been proposed in recent years. We review two families of such methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs that consists of data for 100 different cause-effect pairs selected from 37 datasets from various domains (e.g., meteorology, biology, medicine, engineering, economy, etc.) and motivate our decisions regarding the "ground truth" causal directions of all pairs. We evaluate the performance of several bivariate causal discovery methods on these real-world benchmark data and in addition on artificially simulated data. Our empirical results on real-world data indicate that certain methods are indeed able to distinguish cause from effect using only purely observational data, although more benchmark data would be needed to obtain statistically significant conclusions. One of the best performing methods overall is the additive-noise method originally proposed by Hoyer et al. (2009), which obtains an accuracy of 63+-10 % and an AUC of 0.74+-0.05 on the real-world benchmark. As the main theoretical contribution of this work we prove the consistency of that method.