Assumption-lean and Data-adaptive Post-Prediction Inference
Miao, Jiacheng, Miao, Xinran, Wu, Yixuan, Zhao, Jiwei, Lu, Qiongshi
A fundamental challenge in modern scientific research is the acquisition of gold standard data (Wang et al., 2023). These data, with their high accuracy and reliability, are essential to the validity of scientific discoveries, but obtaining them is often costly and labor-intensive. Fortunately, the advent and rapid development of machine learning (ML) has made it possible to predict outcomes using accessible covariates (He et al., 2016; LeCun et al., 2015). A prominent example is AlphaFold (Jumper et al., 2021), which uses readily available protein amino acid sequences to accurately predict protein structures that traditionally require extensive experimental efforts to determine. This ML-based approach has demonstrated its potential to substantially reduce the time and resources required to measure gold standard data (Cheng et al., 2023; Stokes et al., 2020). Despite these benefits, replacing gold standard data with ML-prediction introduces new challenges, particularly in maintaining the validity of downstream statistical analyses. The indiscriminate use of such predictions, without acknowledging their distinction from observed gold-standard data, can lead to biased results and misleading scientific conclusions (Wang et al., 2020). This issue is exemplified by the statistical analysis using imputed gene expression in the Genotype-Tissue Expression (GTEx) project.
Nov-23-2023
- Country:
- North America > United States > Wisconsin (0.14)
- Genre:
- Research Report (0.82)
- Industry:
- Technology: