Goto

Collaborating Authors

 ai benchmark


A Chinese firm has just launched a constantly changing set of AI benchmarks

MIT Technology Review

Development of the benchmark at HongShan began in 2022, following ChatGPT's breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public. Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model's aptitude on various subjects.


BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Neural Information Processing Systems

AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 40 best practices across a benchmark's life cycle and evaluate 25 AI benchmarks against it.


The Download: AI benchmarks, and Spain's grid blackout

MIT Technology Review

SWE-Bench (pronounced "swee bench") launched in November 2024 as a way to evaluate an AI model's coding skill. It has since quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google--and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. Despite all the fervor, this isn't exactly a truthful assessment of which model is "better." Entrants have begun to game the system--which is pushing many others to wonder whether there's a better way to actually measure AI achievement.


Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

Murray, Malcolm, Papadatos, Henry, Quarks, Otter, Gimenez, Pierre-François, Campos, Simeon

arXiv.org Artificial Intelligence

The literature and multiple experts point to many potential risks from large language models (LLMs), but there are still very few direct measurements of the actual harms posed. AI risk assessment has so far focused on measuring the models' capabilities, but the capabilities of models are only indicators of risk, not measures of risk. Better modeling and quantification of AI risk scenarios can help bridge this disconnect and link the capabilities of LLMs to tangible real-world harm. This paper makes an early contribution to this field by demonstrating how existing AI benchmarks can be used to facilitate the creation of risk estimates. We describe the results of a pilot study in which experts use information from Cybench, an AI benchmark, to generate probability estimates. We show that the methodology seems promising for this purpose, while noting improvements that can be made to further strengthen its application in quantitative AI risk assessment. Figure 1: The performance of LLM benchmarks directly informs the probability estimates generated through expert elicitation. For example, the expert is informed that an LLM can solve the task'Unbreakable' in Cybench and uses this information to increase the probability of success for a malware creation step by 5%.


On Benchmarking Human-Like Intelligence in Machines

Ying, Lance, Collins, Katherine M., Wong, Lionel, Sucholutsky, Ilia, Liu, Ryan, Weller, Adrian, Shu, Tianmin, Griffiths, Thomas L., Tenenbaum, Joshua B.

arXiv.org Artificial Intelligence

Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.


More than Marketing? On the Information Value of AI Benchmarks for Practitioners

Hardy, Amelia, Reuel, Anka, Meimandi, Kiana Jafari, Soder, Lisa, Griffith, Allie, Asmar, Dylan M., Koyejo, Sanmi, Bernstein, Michael S., Kochenderfer, Mykel J.

arXiv.org Artificial Intelligence

Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.


The Download: rethinking AI benchmarks, and the ethics of AI agents

MIT Technology Review

Every time a new AI model is released, it's typically touted as acing its performance against a series of benchmarks. OpenAI's GPT-4o, for example, was launched in May with a compilation of results that showed its performance topping every other AI company's latest model in several tests. The problem is that these benchmarks are poorly designed, the results hard to replicate, and the metrics they use are frequently arbitrary, according to new research. That matters because AI models' scores against these benchmarks determine the level of scrutiny they receive. AI companies frequently cite benchmarks as testament to a new model's success, and those benchmarks already form part of some governments' plans for regulating AI.


BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Reuel, Anka, Hardy, Amelia, Smith, Chandler, Lamparth, Max, Hardy, Malcolm, Kochenderfer, Mykel J.

arXiv.org Artificial Intelligence

AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability, accessible at betterbench.stanford.edu.


Stanford debuts first AI benchmark to help understand LLMs

#artificialintelligence

Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. In the world of artificial intelligence (AI) and machine learning (ML), 2022 has arguably been the year of foundation models, or AI models trained on a massive scale. From GPT-3 to DALL-E, from BLOOM to Imagen -- another day, it seems, another large language model (LLM) or text-to-image model. But until now, there have been no AI benchmarks to provide a standardized way to evaluate these models, which have developed at a rapidly-accelerated pace over the past couple of years. Don't miss our new special issue: Zero trust: The new security paradigm.


Why games may not be the best benchmark for AI

#artificialintelligence

Did you miss a session from the Future of Work Summit? In 2019, San Francisco-based AI research lab OpenAI held a tournament to tout the prowess of OpenAI Five, a system designed to play the multiplayer battle arena game Dota 2. OpenAI Five defeated a team of professional players -- twice. And when made publicly available, OpenAI Five managed to win against 99.4% of people who played against it online. OpenAI has invested heavily in games for research, developing libraries like CoinRun and Neural MMO, a simulator that plops AI in the middle of an RPG-like world. But that approach is changing.