Zeng, Zhenghao
Causal Inference for Genomic Data with Multiple Heterogeneous Outcomes
Du, Jin-Hong, Zeng, Zhenghao, Kennedy, Edward H., Wasserman, Larry, Roeder, Kathryn
With the evolution of single-cell RNA sequencing techniques into a standard approach in genomics, it has become possible to conduct cohort-level causal inferences based on single-cell-level measurements. However, the individual gene expression levels of interest are not directly observable; instead, only repeated proxy measurements from each individual's cells are available, providing a derived outcome to estimate the underlying outcome for each of many genes. In this paper, we propose a generic semiparametric inference framework for doubly robust estimation with multiple derived outcomes, which also encompasses the usual setting of multiple outcomes when the response of each unit is available. To reliably quantify the causal effects of heterogeneous outcomes, we specialize the analysis to standardized average treatment effects and quantile treatment effects. Through this, we demonstrate the use of the semiparametric inferential results for doubly robust estimators derived from both Von Mises expansions and estimating equations. A multiple testing procedure based on Gaussian multiplier bootstrap is tailored for doubly robust estimators to control the false discovery exceedance rate. Applications in single-cell CRISPR perturbation analysis and individual-level differential expression analysis demonstrate the utility of the proposed methods and offer insights into the usage of different estimands for causal inference in genomics.
Continuous Treatment Effects with Surrogate Outcomes
Zeng, Zhenghao, Arbour, David, Feller, Avi, Addanki, Raghavendra, Rossi, Ryan, Sinha, Ritwik, Kennedy, Edward H.
In many causal inference applications, the primary outcomes are missing for a non-trivial number of observations. For instance, in studies on long-term health effects of medical interventions, some measurements require expensive testing and a loss to follow-up is common (Hogan et al., 2004). In evaluating commercial online ad effectiveness, some individuals may drop out from the panel because they use multiple devices (Shankar et al., 2023), leading to missing revenue measures. In many of these studies, however, there often exist short-term outcomes that are easier and faster to measure, e.g., short-term health measures or an online ad's click-through rate, that are observed for a greater share of the sample. These outcomes, which are typically informative about the primary outcomes themselves, are refered to as surrogate outcomes or surrogates. There is a rich causal inference literature addressing missing outcome data. Simply restricting to data with observed primary outcomes may induce strong bias (Hernán and Robins, 2010). Ignoring unlabeled data also reduces the effective sample size for estimating the treatment effects and inflates the variance. Chakrabortty et al. (2022) considered the missing completely at random (MCAR) setting and showed that incorporating unlabeled data reduces variance.