Leveraging text data for causal inference using electronic health records

Mozer, Reagan, Kaufman, Aaron R., Celi, Leo A., Miratrix, Luke

arXiv.org Artificial Intelligence 

The field of medicine has been revolutionized in the last decade by the advent of large clinical databases (Jensen et al., 2012; Friedman et al., 2013; Evans, 2016; Cowie et al., 2017). Electronic health records, administrative databases, and online registries hold a wealth of information with the potential to help answer long-standing questions across all facets of health care, from designing more effective treatment regimes (Komorowski et al., 2018) to tailoring treatment to individuals based on their characteristics (Abul-Husn and Kenny, 2019). A primary challenge in this domain revolves around how to best harness that information, much of which is unstructured data such as text or complex data such as x-ray images (Koleck et al., 2019; Huang et al., 2020; Tayefi et al., 2021). Secondary analysis of even the lowest-hanging fruit from de-identified hospital records has yielded path-breaking insights into the effectiveness of medical interventions, both overturning received wisdom and improving patient care(Critical Data, 2016). These insights rest on a century of research in statistics, econometrics, and causal inference (Kleinberg and Hripcsak, 2011; Hernán et al., 2019) focusing on gleaning valid cause-and-effect relationships from observational data using analysis strategies such as interrupted time series, regression discontinuity, and matching. At the same time, a literature in the computational social sciences has opened new doors to quantify, analyze, and rigorously interpret unstructured data. Unsupervised techniques like Latent Dirichlet Allocation (LDA; Blei et al., 2003) and the Structural Topic Model (STM; Roberts et al., 2014) have enabled researchers with no a priori hypotheses to easily measure the composition of text corpora, while supervised methods like Convolutional Neural Nets (CNN; Albawi et al., 2017) have allowed for the scalable and automatic production of document-level outcomes or covariates (Grimmer et al., 2021), especially when traditional forms of quantitative data are lacking. Recent work (e.g., Roberts et al., 2020; Mozer et al., 2020; Egami et al., 2022) has formalized notation and begun to introduce best practices for incorporating unstructured data such as text into observational studies for improving causal inference, showing how text can fruitfully improve causal estimates in domains such as internet censorship and media

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found