AITopics | ai benchmark

Collaborating Authors

ai benchmark

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

26889e8359e7ef8a7f5d77457364ca55-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-9-2025, 21:16:28 GMT

benchmark, developer, justification, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.93)
Overview (0.67)

Industry:

Government (1.00)
Information Technology > Security & Privacy (0.92)
Health & Medicine > Pharmaceuticals & Biotechnology (0.92)
Law > Statutes (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Security & Privacy (0.92)
Information Technology > Data Science > Data Quality (0.67)
(3 more...)

Add feedback

A Chinese firm has just launched a constantly changing set of AI benchmarks

MIT Technology ReviewJun-23-2025, 15:46:28 GMT

Development of the benchmark at HongShan began in 2022, following ChatGPT's breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public. Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model's aptitude on various subjects.

benchmark, large language model, natural language, (5 more...)

MIT Technology Review

Country: Asia > China (0.06)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.39)

Add feedback

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Neural Information Processing SystemsMay-26-2025, 19:12:22 GMT

artificial intelligence, benchmark, machine learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.44)

Add feedback

The Download: AI benchmarks, and Spain's grid blackout

MIT Technology ReviewMay-8-2025, 12:10:00 GMT

SWE-Bench (pronounced "swee bench") launched in November 2024 as a way to evaluate an AI model's coding skill. It has since quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google--and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. Despite all the fervor, this isn't exactly a truthful assessment of which model is "better." Entrants have begun to game the system--which is pushing many others to wonder whether there's a better way to actually measure AI achievement.

artificial intelligence, grid blackout, natural language, (4 more...)

MIT Technology Review

Country:

Europe > Spain (0.61)
North America > United States > California (0.08)
Europe > Portugal (0.08)
Europe > France (0.08)

Industry: Energy > Renewable (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.65)

Add feedback

Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

Murray, Malcolm, Papadatos, Henry, Quarks, Otter, Gimenez, Pierre-François, Campos, Simeon

arXiv.org Artificial IntelligenceMar-10-2025

The literature and multiple experts point to many potential risks from large language models (LLMs), but there are still very few direct measurements of the actual harms posed. AI risk assessment has so far focused on measuring the models' capabilities, but the capabilities of models are only indicators of risk, not measures of risk. Better modeling and quantification of AI risk scenarios can help bridge this disconnect and link the capabilities of LLMs to tangible real-world harm. This paper makes an early contribution to this field by demonstrating how existing AI benchmarks can be used to facilitate the creation of risk estimates. We describe the results of a pilot study in which experts use information from Cybench, an AI benchmark, to generate probability estimates. We show that the methodology seems promising for this purpose, while noting improvements that can be made to further strengthen its application in quantitative AI risk assessment. Figure 1: The performance of LLM benchmarks directly informs the probability estimates generated through expert elicitation. For example, the expert is informed that an LLM can solve the task'Unbreakable' in Cybench and uses this information to increase the probability of success for a malware creation step by 5%.

benchmark, probability, probability estimate, (16 more...)

arXiv.org Artificial Intelligence

2503.04299

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.47)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

On Benchmarking Human-Like Intelligence in Machines

Ying, Lance, Collins, Katherine M., Wong, Lionel, Sucholutsky, Ilia, Liu, Ryan, Weller, Adrian, Shu, Tianmin, Griffiths, Thomas L., Tenenbaum, Joshua B.

arXiv.org Artificial IntelligenceFeb-27-2025

Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.

benchmark, benchmarking human-like intelligence, participant, (13 more...)

arXiv.org Artificial Intelligence

2502.20502

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > Ghana (0.04)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (1.00)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(5 more...)

Add feedback

More than Marketing? On the Information Value of AI Benchmarks for Practitioners

Hardy, Amelia, Reuel, Anka, Meimandi, Kiana Jafari, Soder, Lisa, Griffith, Allie, Asmar, Dylan M., Koyejo, Sanmi, Bernstein, Michael S., Kochenderfer, Mykel J.

arXiv.org Artificial IntelligenceDec-6-2024

Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.0552

Country: North America > United States > California > Santa Clara County (0.15)

Genre:

Questionnaire & Opinion Survey (1.00)
Personal > Interview (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Applied AI (0.68)
(2 more...)

Add feedback

The Download: rethinking AI benchmarks, and the ethics of AI agents

MIT Technology ReviewNov-26-2024, 13:10:00 GMT

Every time a new AI model is released, it's typically touted as acing its performance against a series of benchmarks. OpenAI's GPT-4o, for example, was launched in May with a compilation of results that showed its performance topping every other AI company's latest model in several tests. The problem is that these benchmarks are poorly designed, the results hard to replicate, and the metrics they use are frequently arbitrary, according to new research. That matters because AI models' scores against these benchmarks determine the level of scrutiny they receive. AI companies frequently cite benchmarks as testament to a new model's success, and those benchmarks already form part of some governments' plans for regulating AI.

benchmark, machine learning, natural language, (7 more...)

MIT Technology Review

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.41)

Add feedback

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Reuel, Anka, Hardy, Amelia, Smith, Chandler, Lamparth, Max, Hardy, Malcolm, Kochenderfer, Mykel J.

arXiv.org Artificial IntelligenceNov-19-2024

AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability, accessible at betterbench.stanford.edu.

benchmark, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2411.1299

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.24)
South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(4 more...)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Law (1.00)
Government (1.00)
Information Technology > Security & Privacy (0.92)
(2 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
(2 more...)

Add feedback

Stanford debuts first AI benchmark to help understand LLMs

#artificialintelligenceNov-20-2022, 00:50:08 GMT

Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. In the world of artificial intelligence (AI) and machine learning (ML), 2022 has arguably been the year of foundation models, or AI models trained on a massive scale. From GPT-3 to DALL-E, from BLOOM to Imagen -- another day, it seems, another large language model (LLM) or text-to-image model. But until now, there have been no AI benchmarks to provide a standardized way to evaluate these models, which have developed at a rapidly-accelerated pace over the past couple of years. Don't miss our new special issue: Zero trust: The new security paradigm.

ai benchmark, benchmark, language model, (16 more...)

#artificialintelligence

Industry: Automobiles & Trucks (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback