Goto

Collaborating Authors

 Adams, Roy


Partial Identifiability in Discrete Data With Measurement Error

arXiv.org Machine Learning

When data contains measurement errors, it is necessary to make assumptions relating the observed, erroneous data to the unobserved true phenomena of interest. These assumptions should be justifiable on substantive grounds, but are often motivated by mathematical convenience, for the sake of exactly identifying the target of inference. We adopt the view that it is preferable to present bounds under justifiable assumptions than to pursue exact identification under dubious ones. To that end, we demonstrate how a broad class of modeling assumptions involving discrete variables, including common measurement error and conditional independence assumptions, can be expressed as linear constraints on the parameters of the model. We then use linear programming techniques to produce sharp bounds for factual and counterfactual distributions under measurement error in such models. We additionally propose a procedure for obtaining outer bounds on non-linear models. Our method yields sharp bounds in a number of important settings -- such as the instrumental variable scenario with measurement error -- for which no bounds were previously known.


Evaluating Model Robustness to Dataset Shift

arXiv.org Machine Learning

The environments in which we deploy machine learning (ML) algorithms rarely look exactly like the environments in which we collected our training data. Unfortunately, we lack methodology for evaluating how well an algorithm will generalize to new environments that differ in a structured way from the training data (i.e., the case of dataset shift (Quiñonero-Candela et al., 2009)). Such methodology is increasingly important as ML systems are being deployed across a number of industries, such as health care and personal finance, in which system performance translates directly to real-world outcomes. Further, as regulation and product reviews become more common across industries, system developers will be expected to produce evidence of the validity and safety of their systems. For example, the United States Food and Drug Administration (FDA) currently regulates ML systems for medical applications, requiring evidence for the validity of such systems before approval is granted (US Food and Drug Administration, 2019). Evaluation methods for assessing model validity have typically focused on how the model performs on data from the training distribution, known as internal validity. Powerful tools, such as cross-validation and the bootstrap, satisfy the assumption that the training and test data are drawn from the same distribution. However, these validation methods do not capture a model's ability to generalize to new environments, known as external validity (Campbell and Stanley, 1963). Currently, the main way to assess a model's external validity is to empirically evaluate performance on multiple, independently collected datasets (e.g.,


Learning Models from Data with Measurement Error: Tackling Underreporting

arXiv.org Machine Learning

Measurement error in observational datasets can lead to systematic bias in inferences based on these datasets. As studies based on observational data are increasingly used to inform decisions with real-world impact, it is critical that we develop a robust set of techniques for analyzing and adjusting for these biases. In this paper we present a method for estimating the distribution of an outcome given a binary exposure that is subject to underreporting. Our method is based on a missing data view of the measurement error problem, where the true exposure is treated as a latent variable that is marginalized out of a joint model. We prove three different conditions under which the outcome distribution can still be identified from data containing only error-prone observations of the exposure. We demonstrate this method on synthetic data and analyze its sensitivity to near violations of the identifiability conditions. Finally, we use this method to estimate the effects of maternal smoking and opioid use during pregnancy on childhood obesity, two import problems from public health. Using the proposed method, we estimate these effects using only subject-reported drug use data and substantially refine the range of estimates generated by a sensitivity analysis-based approach. Further, the estimates produced by our method are consistent with existing literature on both the effects of maternal smoking and the rate at which subjects underreport smoking.