Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field.
Bernard Koch, Emily Denton, Alex Hanna and Jacob Foster won a best paper award, for Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research, in the datasets and benchmarks track at NeurIPS 2021. Here, Bernard tells us about the advantages and disadvantages of benchmarking, the findings of their paper, and plans for future work. Machine learning is a rather unusual science, partly because it straddles the space between science and engineering. The main way that progress is evaluated is through state-of-the-art benchmarking. The scientific community agrees on a shared problem, they pick a dataset which they think is representative of the data that you might see when you try to solve that problem in the real world, then they compare their algorithms on a score for that dataset.
Datasets fuel AI models like gasoline (or electricity, as the case may be) fuels cars. Whether they're tasked with generating text, recognizing objects, or predicting a company's stock price, AI systems "learn" by sifting through countless examples to discern patterns in the data. For example, a computer vision system can be trained to recognize certain types of apparel, like coats and scarfs, by looking at different images of that clothing. Beyond developing models, datasets are used to test trained AI systems to ensure they remain stable -- and measure overall progress in the field. Models that top the leaderboards on certain open source benchmarks are considered state of the art (SOTA) for that particular task.
The 35th edition of the Neural Information Processing Systems conference 2021 (NeurIPS 2021) commenced on December 6, 2021. The nine day conference is packed with a series of tutorials, workshops, and presentations. Over 9,000 papers were submitted at the conference this year, of which 2,344 papers were accepted; this the highest number of papers accepted since 2013. The annual NeurIPS conference is the most awaited and well attended machine learning events of the year. Leading companies and academic institutions like Google, Microsoft, Meta, DeepMind, Stanford, and Carnegie Mellon University participate in great number.
What happens when a machine learning dataset is deprecated for legal, ethical, or technical reasons, but continues to be widely used? In this paper, we examine the public afterlives of several prominent deprecated or redacted datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender, in order to inform a framework for more consistent, ethical, and accountable dataset deprecation. Building on prior research, we find that there is a lack of consistency, transparency, and centralized sourcing of information on the deprecation of datasets, and as such, these datasets and their derivatives continue to be cited in papers and circulate online. These datasets that never die -- which we term "zombie datasets" -- continue to inform the design of production-level systems, causing technical, legal, and ethical challenges; in so doing, they risk perpetuating the harms that prompted their supposed withdrawal, including concerns around bias, discrimination, and privacy. Based on this analysis, we propose a Dataset Deprecation Framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks that can be adapted and implemented by the machine learning community. Drawing on work on datasheets and checklists, we further offer two sample dataset deprecation sheets and propose a centralized repository that tracks which datasets have been deprecated and could be incorporated into the publication protocols of venues like NeurIPS.