Assumption-Lean Post-Integrated Inference with Negative Control Outcomes

Du, Jin-Hong, Roeder, Kathryn, Wasserman, Larry

arXiv.org Machine Learning 

In the big data era, integrating information from multiple heterogeneous sources has become increasingly crucial for achieving larger sample sizes and more diverse study populations. The applications of data integration are in a variety of fields, including but not limited to, causal inference on heterogeneous populations (Shi et al., 2023), survey sampling (Yang et al., 2020), health policy (Paddock et al., 2024), retrospective psychometrics (Howe and Brown, 2023), and multi-omics biological science (Du et al., 2022). Data integration methods have been proposed to mitigate the unwanted effects of heterogeneous datasets and unmeasured covariates, recovering the common variation across datasets. However, a critical and often overlooked question is whether reliable statistical inference can be made from integrated data. Directly performing statistical inference on integrated outcomes and covariates of interests fails to account for the complex correlation structures introduced by the data integration process, often leading to improper analyses that incorrectly assume the corrected data points are independent (Li et al., 2023). While data integration is broadly utilized in various fields, our paper focuses on a challenging scenario with the presence of high-dimensional outcomes.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found