On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets

Liu, Jiashuo, Wang, Tianyu, Cui, Peng, Namkoong, Hongseok

Jul-11-2023–arXiv.org Artificial Intelligence

The performance of predictive models has been observed to degrade under distribution shifts in a wide range of applications, such as healthcare [8, 68, 56, 67], economics [28, 18], education [5], vision [55, 47, 64, 70], and language [46, 6]. Distribution shifts vary in type, typically defined as either a change in the marginal distribution of the covariates (X-shifts), or changes in the conditional relationship between the outcome and covariate (Y |X-shifts). Real-world scenarios comprise of both types of shifts. In computer vision [46, 37, 60, 30, 72], Y |X-shifts are less likely as Y is constructed from human labels given an input X. Due to the prevalence of X-shifts, the implicit goal of many researchers is to develop a single robust model that can generalize effectively across multiple domains, akin to humans. For tabular data, Y |X-shifts may arise because of missing variables and hidden confounders. For example, the prevalence of diseases among patients may be affected by covariates that are not recorded in medical datasets but vary among individuals, such as lifestyle factors (e.g., diet, exercise, smoking status) and socioeconomic status [31, 74, 67]. Under Y |X-shifts, there may be a fundamental trade-off between learning algorithms: to perform well on a target distribution, a model may have to necessarily perform worse on others. Algorithmically, typical methods for addressing Y |X-shifts include distributionally robust optimization (DRO) [11, 63, 21, 59, 20] and causal learning methods [54, 7, 62, 36].

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jul-11-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine
  - Diagnostic Medicine > Imaging (0.45)
  - Therapeutic Area (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found