Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering
Štefánik, Michal, Mickus, Timothee, Kadlčík, Marek, Spiegel, Michal, Kuchař, Josef
–arXiv.org Artificial Intelligence
A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset's quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.
arXiv.org Artificial Intelligence
Aug-27-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Singapore (0.04)
- Europe
- Denmark > Capital Region
- Copenhagen (0.04)
- Finland > Uusimaa
- Helsinki (0.04)
- Italy > Tuscany
- Florence (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Denmark > Capital Region
- North America
- Canada > Ontario
- Toronto (0.04)
- United States (0.14)
- Canada > Ontario
- Asia
- Genre:
- Research Report (1.00)
- Technology: