Not enough data to create a plot.
Try a different view from the menu above.
Hernández-Orallo, José
General Scales Unlock AI Evaluation with Explanatory and Predictive Power
Zhou, Lexin, Pacchiardi, Lorenzo, Martínez-Plumed, Fernando, Collins, Katherine M., Moros-Daval, Yael, Zhang, Seraphina, Zhao, Qinlin, Huang, Yitian, Sun, Luning, Prunty, Jonathan E., Li, Zongqian, Sánchez-García, Pablo, Chen, Kexin Jiang, Casares, Pablo A. M., Zu, Jiyun, Burden, John, Mehrbakhsh, Behzad, Stillwell, David, Cebrian, Manuel, Wang, Jindong, Henderson, Peter, Wu, Sherry Tongshuang, Kyllonen, Patrick C., Cheke, Lucy, Xie, Xing, Hernández-Orallo, José
Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)
Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture
Burden, John, Tešić, Marko, Pacchiardi, Lorenzo, Hernández-Orallo, José
Research in AI evaluation has grown increasingly complex and multidisciplinary, attracting researchers with diverse backgrounds and objectives. As a result, divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other's contributions. This fragmentation has led to insular research trajectories and communication barriers both among different paradigms and with the general public, contributing to unmet expectations for deployed AI systems. To help bridge this insularity, in this paper we survey recent work in the AI evaluation landscape and identify six main paradigms. We characterise major recent contributions within each paradigm across key dimensions related to their goals, methodologies and research cultures. By clarifying the unique combination of questions and approaches associated with each paradigm, we aim to increase awareness of the breadth of current evaluation approaches and foster cross-pollination between different paradigms. We also identify potential gaps in the field to inspire future research directions.
PredictaBoard: Benchmarking LLM Score Predictability
Pacchiardi, Lorenzo, Voudouris, Konstantinos, Slater, Ben, Martínez-Plumed, Fernando, Hernández-Orallo, José, Zhou, Lexin, Schellaert, Wout
Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated. Code for our benchmark can be found at https://github.com/Kinds-of-Intelligence-CFI/PredictaBoard
What should an AI assessor optimise for?
Romero-Alvarado, Daniel, Martínez-Plumed, Fernando, Hernández-Orallo, José
An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system. Assessors can lever-age information from the test results of many other AI systems and have the flexibility of be-ing trained on any loss function or scoring rule: from squared error to toxicity metrics. Here we address the question: is it always optimal to train the assessor for the target metric? Or could it be better to train for a different metric and then map predictions back to the target metric? Us-ing twenty regression and classification problems with tabular data, we experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings and find that, contrary to intuition, optimising for more informative met-rics is not generally better. Surprisingly, some monotonic transformations are promising. For example, the logistic loss is useful for minimis-ing absolute or quadratic errors in regression, and the logarithmic score helps maximise quadratic or spherical scores in classification.
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
Pacchiardi, Lorenzo, Tesic, Marko, Cheke, Lucy G., Hernández-Orallo, José
The integrity of AI benchmarks is fundamental to accurately assess the capabilities of AI systems. The internal validity of these benchmarks - i.e., making sure they are free from confounding factors - is crucial for ensuring that they are measuring what they are designed to measure. In this paper, we explore a key issue related to internal validity: the possibility that AI systems can solve benchmarks in unintended ways, bypassing the capability being tested. This phenomenon, widely known in human and animal experiments, is often referred to as the 'Clever Hans' effect, where tasks are solved using spurious cues, often involving much simpler processes than those putatively assessed. Previous research suggests that language models can exhibit this behaviour as well. In several older Natural Language Processing (NLP) benchmarks, individual $n$-grams like "not" have been found to be highly predictive of the correct labels, and supervised NLP models have been shown to exploit these patterns. In this work, we investigate the extent to which simple $n$-grams extracted from benchmark instances can be combined to predict labels in modern multiple-choice benchmarks designed for LLMs, and whether LLMs might be using such $n$-gram patterns to solve these benchmarks. We show how simple classifiers trained on these $n$-grams can achieve high scores on several benchmarks, despite lacking the capabilities being tested. Additionally, we provide evidence that modern LLMs might be using these superficial patterns to solve benchmarks. This suggests that the internal validity of these benchmarks may be compromised and caution should be exercised when interpreting LLM performance results on them.
Learning Alternative Ways of Performing a Task
Nieves, David, Ramírez-Quintana, María José, Monserrat, Carlos, Ferri, César, Hernández-Orallo, José
A common way of learning to perform a task is to observe how it is carried out by experts. However, it is well known that for most tasks there is no unique way to perform them. This is especially noticeable the more complex the task is because factors such as the skill or the know-how of the expert may well affect the way she solves the task. In addition, learning from experts also suffers of having a small set of training examples generally coming from several experts (since experts are usually a limited and expensive resource), being all of them positive examples (i.e. examples that represent successful executions of the task). Traditional machine learning techniques are not useful in such scenarios, as they require extensive training data. Starting from very few executions of the task presented as activity sequences, we introduce a novel inductive approach for learning multiple models, with each one representing an alternative strategy of performing a task. By an iterative process based on generalisation and specialisation, we learn the underlying patterns that capture the different styles of performing a task exhibited by the examples. We illustrate our approach on two common activity recognition tasks: a surgical skills training task and a cooking domain. We evaluate the inferred models with respect to two metrics that measure how well the models represent the examples and capture the different forms of executing a task showed by the examples. We compare our results with the traditional process mining approach and show that a small set of meaningful examples is enough to obtain patterns that capture the different strategies that are followed to solve the tasks.
Animal-AI 3: What's New & Why You Should Care
Voudouris, Konstantinos, Alhas, Ibrahim, Schellaert, Wout, Crosby, Matthew, Holmes, Joel, Burden, John, Chaubey, Niharika, Donnelly, Niall, Patel, Matishalin, Halina, Marta, Hernández-Orallo, José, Cheke, Lucy G.
The Animal-AI Environment is a unique game-based research platform designed to serve both the artificial intelligence and cognitive science research communities. In this paper, we present Animal-AI 3, the latest version of the environment, outlining several major new features that make the game more engaging for humans and more complex for AI systems. New features include interactive buttons, reward dispensers, and player notifications, as well as an overhaul of the environment's graphics and processing for significant increases in agent training time and quality of the human player experience. We provide detailed guidance on how to build computational and behavioural experiments with Animal-AI 3. We present results from a series of agents, including the state-of-the-art Deep Reinforcement Learning agent (dreamer-v3), on newly designed tests and the Animal-AI Testbed of 900 tasks inspired by research in comparative psychology. Animal-AI 3 is designed to facilitate collaboration between the cognitive sciences and artificial intelligence. This paper serves as a stand-alone document that motivates, describes, and demonstrates Animal-AI 3 for the end user.
An International Consortium for Evaluations of Societal-Scale Risks from Advanced AI
Gruetzemacher, Ross, Chan, Alan, Frazier, Kevin, Manning, Christy, Los, Štěpán, Fox, James, Hernández-Orallo, José, Burden, John, Franklin, Matija, Ghuidhir, Clíodhna Ní, Bailey, Mark, Eth, Daniel, Pilditch, Toby, Kilian, Kyle
Given rapid progress toward advanced AI and risks from frontier AI systems (advanced AI systems pushing the boundaries of the AI capabilities frontier), the creation and implementation of AI governance and regulatory schemes deserves prioritization and substantial investment. However, the status quo is untenable and, frankly, dangerous. A regulatory gap has permitted AI labs to conduct research, development, and deployment activities with minimal oversight. In response, frontier AI system evaluations have been proposed as a way of assessing risks from the development and deployment of frontier AI systems. Yet, the budding AI risk evaluation ecosystem faces significant coordination challenges, such as a limited diversity of evaluators, suboptimal allocation of effort, and perverse incentives. This paper proposes a solution in the form of an international consortium for AI risk evaluations, comprising both AI developers and third-party AI risk evaluators. Such a consortium could play a critical role in international efforts to mitigate societal-scale risks from advanced AI, including in managing responsible scaling policies and coordinated evaluation-based risk response. In this paper, we discuss the current evaluation ecosystem and its shortcomings, propose an international consortium for advanced AI risk evaluations, discuss issues regarding its implementation, discuss lessons that can be learnt from previous international institutions and existing proposals for international AI governance institutions, and, finally, we recommend concrete steps to advance the establishment of the proposed consortium: (i) solicit feedback from stakeholders, (ii) conduct additional research, (iii) conduct a workshop(s) for stakeholders, (iv) analyze feedback and create final proposal, (v) solicit funding, and (vi) create a consortium.
Predictable Artificial Intelligence
Zhou, Lexin, Moreno-Casares, Pablo A., Martínez-Plumed, Fernando, Burden, John, Burnell, Ryan, Cheke, Lucy, Ferri, Cèsar, Marcoci, Alexandru, Mehrbakhsh, Behzad, Moros-Daval, Yael, hÉigeartaigh, Seán Ó, Rutar, Danaja, Schellaert, Wout, Voudouris, Konstantinos, Hernández-Orallo, José
We introduce the fundamental ideas and challenges of Predictable AI, a nascent research area that explores the ways in which we can anticipate key indicators of present and future AI ecosystems. We argue that achieving predictability is crucial for fostering trust, liability, control, alignment and safety of AI ecosystems, and thus should be prioritised over performance. While distinctive from other areas of technical and non-technical AI research, the questions, hypotheses and challenges relevant to Predictable AI were yet to be clearly described. This paper aims to elucidate them, calls for identifying paths towards AI predictability and outlines the potential impact of this emergent field.
Inferring Capabilities from Task Performance with Bayesian Triangulation
Burden, John, Voudouris, Konstantinos, Burnell, Ryan, Rutar, Danaja, Cheke, Lucy, Hernández-Orallo, José
As machine learning models become more general, we need to characterise them in richer, more meaningful ways. We describe a method to infer the cognitive profile of a system from diverse experimental data. To do so, we introduce measurement layouts that model how task-instance features interact with system capabilities to affect performance. These features must be triangulated in complex ways to be able to infer capabilities from non-populational data -- a challenge for traditional psychometric and inferential tools. Using the Bayesian probabilistic programming library PyMC, we infer different cognitive profiles for agents in two scenarios: 68 actual contestants in the AnimalAI Olympics and 30 synthetic agents for O-PIAAGETS, an object permanence battery. We showcase the potential for capability-oriented evaluation.