A new similarity measure for covariate shift with applications to nonparametric regression

Pathak, Reese, Ma, Cong, Wainwright, Martin J.

arXiv.org Machine Learning 

In the standard formulation of prediction or classification, future data (as represented by a test set) is assumed to be drawn from the same distribution as the training data. This assumption, while theoretically convenient, may fail to hold in many real-world scenarios. For instance, training data might be collected only from a sub-group within a broader population (such as in medical trials), or the environment might change over time as data are collected. Such scenarios result in a distribution mismatch between the training and test data. In this paper, we study an important case of such distribution mismatch--namely, the covariate shift problem (e.g., [21, 19]). Suppose that a statistician observes covariate-response pairs (X, Y), and wishes to build a prediction rule. In the problem of covariate shift, the distribution of the covariates X is allowed to change between the training and test data, while the posterior distribution of the responses (namely, Y X) remains fixed. Compared to the usual i.i.d.