Watts, Duncan J.
Pre-registration for Predictive Modeling
Hofman, Jake M., Chatzimparmpas, Angelos, Sharma, Amit, Watts, Duncan J., Hullman, Jessica
Several scientific communities are currently facing a replication crisis, wherein it has proven difficult or impossible for researchers to independently verify the results of previously published studies. Failures to replicate large swaths of experimental work (Camerer et al., 2018; Nosek et al., 2015; Begley and Ellis, 2012; Baker, 2016) have come in fields like psychology or medicine, that focus on what Hofman et al. (2021) call explanatory modeling, where the goal is to identify and estimate causal effects (e.g., is there an effect of X on Y, and if so, how large is it?). While there are many different factors that can contribute to unreliable findings in explanatory modeling, the combination of small-scale experiments involving noisy measurements and the (mis)use of null hypothesis significance testing (NHST) has received a great deal of attention in recent years. Under these conditions, researchers can mistake idiosyncratic patterns in noise for true effects, resulting in unreliable findings that do not replicate upon further investigation (Button et al., 2013; Loken and Gelman, 2017; Meehl, 1990; Simmons et al., 2011). More generally, some forms of data-dependent decision making (e.g., about how to define research questions or hypotheses, how to filter or transform data, how to model data, what tests to run, etc.) can lead to similar problems regardless of the specifics of the methods (Gelman and Loken, 2013). What about other fields, such as machine learning and data science, that focus less on explanation and more on predictive modeling, defined in Hofman et al. (2021) as directly forecasting outcomes (e.g., how well can an outcome Y be predicted using all available features X?) without necessarily focusing on isolating individual causal effects? Predictive modeling is typically done by testing (out-of-sample) predictions on large-scale datasets, and hence--unlike explanatory modeling--involves neither small experiments nor misuse of significance testing. With advances in the fields of statistics and machine learning (ML) we have seen remarkable performance gains in predictive modeling over the last decade, for both traditional ML tasks and for scientific applications. The same methods that have been shown to achieve at or above human-level performance on tasks like playing chess, classifying images, or understanding natural language (Zhang et al.,
Split-door criterion: Identification of causal effects through auxiliary outcomes
Sharma, Amit, Hofman, Jake M., Watts, Duncan J.
We present a method for estimating causal effects in time series data when fine-grained information about the outcome of interest is available. Specifically, we examine what we call the split-door setting, where the outcome variable can be split into two parts: one that is potentially affected by the cause being studied and another that is independent of it, with both parts sharing the same (unobserved) confounders. We show that under these conditions, the problem of identification reduces to that of testing for independence among observed variables, and present a method that uses this approach to automatically find subsets of the data that are causally identified. We demonstrate the method by estimating the causal impact of Amazon's recommender system on traffic to product pages, finding thousands of examples within the dataset that satisfy the split-door criterion. Unlike past studies based on natural experiments that were limited to a single product category, our method applies to a large and representative sample of products viewed on the site. In line with previous work, we find that the widely-used click-through rate (CTR) metric overestimates the causal impact of recommender systems; depending on the product category, we estimate that 50-80\% of the traffic attributed to recommender systems would have happened even without any recommendations. We conclude with guidelines for using the split-door criterion as well as a discussion of other contexts where the method can be applied.