Leveraging text data for causal inference using electronic health records
Mozer, Reagan, Kaufman, Aaron R., Celi, Leo A., Miratrix, Luke
–arXiv.org Artificial Intelligence
The field of medicine has been revolutionized in the last decade by the advent of large clinical databases (Jensen et al., 2012; Friedman et al., 2013; Evans, 2016; Cowie et al., 2017). Electronic health records, administrative databases, and online registries hold a wealth of information with the potential to help answer long-standing questions across all facets of health care, from designing more effective treatment regimes (Komorowski et al., 2018) to tailoring treatment to individuals based on their characteristics (Abul-Husn and Kenny, 2019). A primary challenge in this domain revolves around how to best harness that information, much of which is unstructured data such as text or complex data such as x-ray images (Koleck et al., 2019; Huang et al., 2020; Tayefi et al., 2021). Secondary analysis of even the lowest-hanging fruit from de-identified hospital records has yielded path-breaking insights into the effectiveness of medical interventions, both overturning received wisdom and improving patient care(Critical Data, 2016). These insights rest on a century of research in statistics, econometrics, and causal inference (Kleinberg and Hripcsak, 2011; Hernán et al., 2019) focusing on gleaning valid cause-and-effect relationships from observational data using analysis strategies such as interrupted time series, regression discontinuity, and matching. At the same time, a literature in the computational social sciences has opened new doors to quantify, analyze, and rigorously interpret unstructured data. Unsupervised techniques like Latent Dirichlet Allocation (LDA; Blei et al., 2003) and the Structural Topic Model (STM; Roberts et al., 2014) have enabled researchers with no a priori hypotheses to easily measure the composition of text corpora, while supervised methods like Convolutional Neural Nets (CNN; Albawi et al., 2017) have allowed for the scalable and automatic production of document-level outcomes or covariates (Grimmer et al., 2021), especially when traditional forms of quantitative data are lacking. Recent work (e.g., Roberts et al., 2020; Mozer et al., 2020; Egami et al., 2022) has formalized notation and begun to introduce best practices for incorporating unstructured data such as text into observational studies for improving causal inference, showing how text can fruitfully improve causal estimates in domains such as internet censorship and media
arXiv.org Artificial Intelligence
Jun-9-2023
- Country:
- North America > United States
- New York (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Asia
- Singapore (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- North America > United States
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Technology: