Pre-validation Revisited
Shang, Jing, Chatterjee, Sourav, Hastie, Trevor, Tibshirani, Robert
Modern biomedical technologies have transformed how we diagnose, treat, and prevent diseases in the past decade. In particular, we see surging needs in combining gene expression data with traditional clinical measurements. However, since gene expression data is usually high-dimensional, while the total number of available clinical measurements is rather limited, naive approaches such as pooling all features together will not work well. Particularly, the microarray features would dominate in inference and variable selection, resulting in negligence of valuable information from clinical data or failure of type I error control. To address the issue caused by imbalance in feature dimensions, Tibshirani & Efron (2002) proposed the pre-validation procedure to make a fairer comparison between the two sets of predictors. In the setting of gene expression data and clinical measurements, the procedure includes two steps: first we use high-dimensional gene expression data to train the leave-one-out fits of the response, then we use the fitted values and clinical data to build a final prediction model. It turns out that the pre-validation procedure not only enables us to test whether the gene expression data have predictive power with type-I error control, but also gives a good estimate of the prediction error. Figure 1 illustrates the two-stage pre-validation procedure.
May-23-2025