High-Dimensional Importance-Weighted Information Criteria: Theory and Optimality

Cao, Yong-Syun, Imori, Shinpei, Ing, Ching-Kang

arXiv.org Machine Learning 

V arious methods for high-dimensional model selection have been developed in recent years to address situations where the training and test data come from different distributions. When both input and output variables are available in the source (training) and target (test) domains but the target sample size is small, estimates based solely on the target data often suffer from high variance. To improve accuracy, auxiliary estimates from the source domain can be incorporated, along with bias correction to account for domain differences. This transfer learning strategy facilitates more reliable estimation under limited target information (see, for example, Li et al. (2021), Bastani (2021), and Tian and Feng (2022)). However, when test outputs (i.e., target responses) are unavailable, estimation or bias correction involving both domains becomes infeasible, as only inputs (covariates) are observed in the test set.