On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets

Liu, Jiashuo, Wang, Tianyu, Cui, Peng, Namkoong, Hongseok

arXiv.org Artificial Intelligence 

The performance of predictive models has been observed to degrade under distribution shifts in a wide range of applications, such as healthcare [8, 68, 56, 67], economics [28, 18], education [5], vision [55, 47, 64, 70], and language [46, 6]. Distribution shifts vary in type, typically defined as either a change in the marginal distribution of the covariates (X-shifts), or changes in the conditional relationship between the outcome and covariate (Y |X-shifts). Real-world scenarios comprise of both types of shifts. In computer vision [46, 37, 60, 30, 72], Y |X-shifts are less likely as Y is constructed from human labels given an input X. Due to the prevalence of X-shifts, the implicit goal of many researchers is to develop a single robust model that can generalize effectively across multiple domains, akin to humans. For tabular data, Y |X-shifts may arise because of missing variables and hidden confounders. For example, the prevalence of diseases among patients may be affected by covariates that are not recorded in medical datasets but vary among individuals, such as lifestyle factors (e.g., diet, exercise, smoking status) and socioeconomic status [31, 74, 67]. Under Y |X-shifts, there may be a fundamental trade-off between learning algorithms: to perform well on a target distribution, a model may have to necessarily perform worse on others. Algorithmically, typical methods for addressing Y |X-shifts include distributionally robust optimization (DRO) [11, 63, 21, 59, 20] and causal learning methods [54, 7, 62, 36].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found