deprecation
The Leaderboard Illusion
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also become more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have skewed the competitive landscape. Specifically, undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and selectively retract scores.
Deprecating Benchmarks: Criteria and Framework
Joaquin, Ayrton San, Gipiškis, Rokas, Staufer, Leon, Gil, Ariel
As frontier artificial intelligence (AI) models rapidly advance, benchmarks are integral to comparing different models and measuring their progress in different task-specific domains. However, there is a lack of guidance on when and how benchmarks should be deprecated once they cease to effectively perform their purpose. This risks benchmark scores over-valuing model capabilities, or worse, obscuring capabilities and safety-washing. Based on a review of benchmarking practices, we propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks. Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models, and our recommendations are aimed to benefit benchmark developers, benchmark users, AI governance actors (across governments, academia, and industry panels), and policy makers.
The Leaderboard Illusion
Singh, Shivalika, Nan, Yiyang, Wang, Alex, D'Souza, Daniel, Kapoor, Sayash, Üstün, Ahmet, Koyejo, Sanmi, Deng, Yuntian, Longpre, Shayne, Smith, Noah A., Ermis, Beyza, Fadaee, Marzieh, Hooker, Sara
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field
Neural Transition-based Parsing of Library Deprecations
Babkin, Petr, Navarro, Nacho, Alamir, Salwa, Shah, Sameena
This paper tackles the challenging problem of automating code updates to fix deprecated API usages of open source libraries by analyzing their release notes. Our system employs a three-tier architecture: first, a web crawler service retrieves deprecation documentation from the web; then a specially built parser processes those text documents into tree-structured representations; finally, a client IDE plugin locates and fixes identified deprecated usages of libraries in a given codebase. The focus of this paper in particular is the parsing component. We introduce a novel transition-based parser in two variants: based on a classical feature engineered classifier and a neural tree encoder. To confirm the effectiveness of our method, we gathered and labeled a set of 426 API deprecations from 7 well-known Python data science libraries, and demonstrated our approach decisively outperforms a non-trivial neural machine translation baseline.
Predictions Series 2022: How to Win in an Opt-In Era
Opt-in doomsayers believe this legislation could cripple the entire advertising industry because consumers will have more meaningful control over their privacy, and authenticated audiences will shrink as a result. However, the trends that have shaped the market can be bucked, and publishers and advertisers have an exciting opportunity to create a new ecosystem in compliance with the opt-in marketplace that benefits everyone involved – including consumers. Further, the browser and device manufacturer changes that are already in-progress are already moving the industry towards a more logged-in environment, in which it becomes easier for consumers to opt in as they authenticate. As we look towards an opt-in era in the future, it's important that publishers and marketers consider how the industry arrived at this point, and the lessons they can take away from this journey. Under the opt-out default, it's easy to see that the consumer experience has been lacking, and a lot of that falls on technology. The opt-out default enabled the propagation of third-party cookies, and the collection of data – often in a way that was not as transparent as it could have been for consumers.
The Problem of Zombie Datasets:A Framework For Deprecating Datasets
Corry, Frances, Sridharan, Hamsini, Luccioni, Alexandra Sasha, Ananny, Mike, Schultz, Jason, Crawford, Kate
What happens when a machine learning dataset is deprecated for legal, ethical, or technical reasons, but continues to be widely used? In this paper, we examine the public afterlives of several prominent deprecated or redacted datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender, in order to inform a framework for more consistent, ethical, and accountable dataset deprecation. Building on prior research, we find that there is a lack of consistency, transparency, and centralized sourcing of information on the deprecation of datasets, and as such, these datasets and their derivatives continue to be cited in papers and circulate online. These datasets that never die -- which we term "zombie datasets" -- continue to inform the design of production-level systems, causing technical, legal, and ethical challenges; in so doing, they risk perpetuating the harms that prompted their supposed withdrawal, including concerns around bias, discrimination, and privacy. Based on this analysis, we propose a Dataset Deprecation Framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks that can be adapted and implemented by the machine learning community. Drawing on work on datasheets and checklists, we further offer two sample dataset deprecation sheets and propose a centralized repository that tracks which datasets have been deprecated and could be incorporated into the publication protocols of venues like NeurIPS.