Ilya Shpitser
Identification and Estimation of Causal Effects from Dependent Data
Eli Sherman, Ilya Shpitser
The assumption that data samples are independent and identically distributed (iid) is standard in many areas of statistics and machine learning. Nevertheless, in some settings, such as social networks, infectious disease modeling, and reasoning with spatial and temporal data, this assumption is false. An extensive literature exists on making causal inferences under the iid assumption [17, 11, 26, 21], even when unobserved confounding bias may be present. But, as pointed out in [19], causal inference in non-iid contexts is challenging due to the presence of both unobserved confounding and data dependence. In this paper we develop a general theory describing when causal inferences are possible in such scenarios. We use segregated graphs [20], a generalization of latent projection mixed graphs [28], to represent causal models of this type and provide a complete algorithm for nonparametric identification in these models. We then demonstrate how statistical inference may be performed on causal parameters identified by this algorithm. In particular, we consider cases where only a single sample is available for parts of the model due to full interference, i.e., all units are pathwise dependent and neighbors' treatments affect each others' outcomes [24]. We apply these techniques to a synthetic data set which considers users sharing fake news articles given the structure of their social network, user activity levels, and baseline demographics and socioeconomic covariates.
Identification and Estimation of Causal Effects from Dependent Data
Eli Sherman, Ilya Shpitser
The assumption that data samples are independent and identically distributed (iid) is standard in many areas of statistics and machine learning. Nevertheless, in some settings, such as social networks, infectious disease modeling, and reasoning with spatial and temporal data, this assumption is false. An extensive literature exists on making causal inferences under the iid assumption [17, 11, 26, 21], even when unobserved confounding bias may be present. But, as pointed out in [19], causal inference in non-iid contexts is challenging due to the presence of both unobserved confounding and data dependence. In this paper we develop a general theory describing when causal inferences are possible in such scenarios. We use segregated graphs [20], a generalization of latent projection mixed graphs [28], to represent causal models of this type and provide a complete algorithm for nonparametric identification in these models. We then demonstrate how statistical inference may be performed on causal parameters identified by this algorithm. In particular, we consider cases where only a single sample is available for parts of the model due to full interference, i.e., all units are pathwise dependent and neighbors' treatments affect each others' outcomes [24]. We apply these techniques to a synthetic data set which considers users sharing fake news articles given the structure of their social network, user activity levels, and baseline demographics and socioeconomic covariates.
Consistent Estimation of Functions of Data Missing Non-Monotonically and Not at Random
Ilya Shpitser
Missing records are a perennial problem in analysis of complex data of all types, when the target of inference is some function of the full data law. In simple cases, where data is missing at random or completely at random [15], well-known adjustments exist that result in consistent estimators of target quantities. Assumptions underlying these estimators are generally not realistic in practical missing data problems. Unfortunately, consistent estimators in more complex cases where data is missing not at random, and where no ordering on variables induces monotonicity of missingness status are not known in general, with some notable exceptions [13, 18, 16]. In this paper, we propose a general class of consistent estimators for cases where data is missing not at random, and missingness status is non-monotonic. Our estimators, which are generalized inverse probability weighting estimators, make no assumptions on the underlying full data law, but instead place independence restrictions, and certain other fairly mild assumptions, on the distribution of missingness status conditional on the data. The assumptions we place on the distribution of missingness status conditional on the data can be viewed as a version of a conditional Markov random field (MRF) corresponding to a chain graph. Assumptions embedded in our model permit identification from the observed data law, and admit a natural fitting procedure based on the pseudo likelihood approach of [2]. We illustrate our approach with a simple simulation study, and an analysis of risk of premature birth in women in Botswana exposed to highly active anti-retroviral therapy.