Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Motivated by this, there is a growing focus on curating benchmark datasets that capture distribution shifts. While valuable, the existing benchmarks are limited in that many of them only contain a small number of shifts and they lack systematic annotation about what is different across different shifts. We present MetaShift--a collection of 12,868 sets of natural images across 410 classes--to address this challenge. We leverage the natural heterogeneity of Visual Genome and its annotations to construct MetaShift. The key construction idea is to cluster images using its metadata, which provides context for each image (e.g. "cats with cars" or "cats in bathroom") that represent distinct data distributions. MetaShift has two important benefits: first, it contains orders of magnitude more natural data shifts than previously available. Second, it provides explicit explanations of what is unique about each of its data sets and a distance score that measures the amount of distribution shift between any two of its data sets. We demonstrate the utility of MetaShift in benchmarking several recent proposals for training models to be robust to data shifts. We find that the simple empirical risk minimization performs the best when shifts are moderate and no method had a systematic advantage for large shifts. We also show how MetaShift can help to visualize conflicts between data subsets during model training.
To deploy machine learning algorithms in real-world applications, we must pay attention to distribution shift, i.e. when the test distribution is different from the training distribution, which substantially degrades model performance. In this paper, we refer this problem as out-of-distribution (OOD) generalization and specifically consider performance gaps caused by two kinds of distribution shifts: domain shifts and subpopulation shifts. In domain shifts, the test data is sampled from different domains than the training data, which requires the trained model to generalize well to test domains without seeing the data from those domains in training time. Take health risk prediction as an example. We may want to train a model on patients from a few sampled hospitals and then deploy the model to a broader set of hospitals (Koh et al., 2021).
The null hypothesis is that the samples are drawn from the same distribution, which means that a low p-value is indicative of different distributions. In this example, we see that drift was detected for all of our features. In addition to statistical tests, there are other approaches you can take to tackle distribution shifts, such as visually inspecting histograms and distribution charts for individual features, which can be useful to confirm the disparity between distributions. In a more general topic, setting rule-based data validation is key in ensuring the quality of your data, which includes distribution changes, be it from external factors or systemic errors such as pipeline errors or missing data. For a more in-depth view on this topic, you can sign up for my upcoming workshop at ODSC Europe this June "Visually Inspecting Data Profiles for Data Distribution Shifts". In the workshop, we will also see how to visually inspect histograms and distribution charts and how to do data validation with whylogs' constraints. We will dig deeper into the concept of distribution shift and explore other popular packages in order to detect data shifts.
Recent work has shown that the performance of machine learning models can vary substantially when models are evaluated on data drawn from a distribution that is close to but different from the training distribution. As a result, predicting model performance on unseen distributions is an important challenge. Our work connects techniques from domain adaptation and predictive uncertainty literature, and allows us to predict model accuracy on challenging unseen distributions without access to labeled data. In the context of distribution shift, distributional distances are often used to adapt models and improve their performance on new domains, however accuracy estimation, or other forms of predictive uncertainty, are often neglected in these investigations. Through investigating a wide range of established distributional distances, such as Frechet distance or Maximum Mean Discrepancy, we determine that they fail to induce reliable estimates of performance under distribution shift. On the other hand, we find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts. We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference. $DoC$ reduces predictive error by almost half ($46\%$) on several realistic and challenging distribution shifts, e.g., on the ImageNet-Vid-Robust and ImageNet-Rendition datasets.
We formulate the problem of induced domain adaptation (IDA) when the underlying distribution/domain shift is introduced by the model being deployed. Our formulation is motivated by applications where the deployed machine learning models interact with human agents, and will ultimately face responsive and interactive data distributions. We formalize the discussions of the transferability of learning in our IDA setting by studying how the model trained on the available source distribution (data) would translate to the performance on the induced domain. We provide both upper bounds for the performance gap due to the induced domain shift, as well as lower bound for the trade-offs a classifier has to suffer on either the source training distribution or the induced target distribution. We provide further instantiated analysis for two popular domain adaptation settings with covariate shift and label shift. We highlight some key properties of IDA, as well as computational and learning challenges.