Goto

Collaborating Authors

 effective robustness



Granularity__final

Thao Nguyen

Neural Information Processing Systems

We use the iWildCam version 2.0 released in 2021 as a Examples of train set images can be seen in Figure 14. Random examples from the out-of-distribution test set. Figure 15 shows examples of train set images. Figure 15: Random examples from the ImageNet ILSVRC 2012 challenge train set [37, 11]. The full training set is notably not class balanced, exhibiting a long-tailed distribution (see Figure 16). Figure 17: Random examples from the iNaturalist 2017 challenge train set [46].



Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Neural Information Processing Systems

Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data. To do this, we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of effective robustness when there are models with different training data. It may also explain the surprising effective robustness gains of zero-shot CLIP-like models exhibited in prior works that used ImageNet as the only ID test set, while the gains diminish under our new evaluation.


Models Out of Line: A Fourier Lens on Distribution Shift Robustness

Neural Information Processing Systems

Improving the accuracy of deep neural networks on out-of-distribution (OOD) data is critical to an acceptance of deep learning in real world applications. It has been observed that accuracies on in-distribution (ID) versus OOD data follow a linear trend and models that outperform this baseline are exceptionally rare (and referred to as ``effectively robust"). Recently, some promising approaches have been developed to improve OOD robustness: model pruning, data augmentation, and ensembling or zero-shot evaluating large pretrained models. However, there still is no clear understanding of the conditions on OOD data and model properties that are required to observe effective robustness. We approach this issue by conducting a comprehensive empirical study of diverse approaches that are known to impact OOD robustness on a broad range of natural and synthetic distribution shifts of CIFAR-10 and ImageNet. In particular, we view the effective robustness puzzle through a Fourier lens and ask how spectral properties of both models and OOD data correlate with OOD robustness. We find this Fourier lens offers some insight into why certain robust models, particularly those from the CLIP family, achieve OOD robustness. However, our analysis also makes clear that no known metric is consistently the best explanation of OOD robustness. Thus, to aid future research into the OOD puzzle, we address the gap in publicly-available models with effective robustness by introducing a set of pretrained CIFAR-10 models---$RobustNets$---with varying levels of OOD robustness.


AT overview

Neural Information Processing Systems

Each row is a model, and each column is an evaluation setting. A few cells are empty due to resource constraints. As discussed in Section 4.1, multiple models trained on more data achieve positive effective robustness However, this effect is not uniform. Our experiments suggest that neither growing the number of images nor classes in an i.i.d. For one, our experiments consider only i.i.d.



We thank the reviewers for their feedback and reply to the major points raised by each reviewer individually

Neural Information Processing Systems

We thank the reviewers for their feedback and reply to the major points raised by each reviewer individually. Our paper focuses on ImageNet classification because this is what almost all prior work on robustness has studied. We hope that future work (e.g., transfer learning research) can build on our testbed. Our results are substantially more nuanced than "more data helps": (i) We show that only more data currently helps This is a strong negative result. Appendix D contains additional results for more granular trends.