Schellaert, Wout
PredictaBoard: Benchmarking LLM Score Predictability
Pacchiardi, Lorenzo, Voudouris, Konstantinos, Slater, Ben, Martínez-Plumed, Fernando, Hernández-Orallo, José, Zhou, Lexin, Schellaert, Wout
Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated. Code for our benchmark can be found at https://github.com/Kinds-of-Intelligence-CFI/PredictaBoard
Animal-AI 3: What's New & Why You Should Care
Voudouris, Konstantinos, Alhas, Ibrahim, Schellaert, Wout, Crosby, Matthew, Holmes, Joel, Burden, John, Chaubey, Niharika, Donnelly, Niall, Patel, Matishalin, Halina, Marta, Hernández-Orallo, José, Cheke, Lucy G.
The Animal-AI Environment is a unique game-based research platform designed to serve both the artificial intelligence and cognitive science research communities. In this paper, we present Animal-AI 3, the latest version of the environment, outlining several major new features that make the game more engaging for humans and more complex for AI systems. New features include interactive buttons, reward dispensers, and player notifications, as well as an overhaul of the environment's graphics and processing for significant increases in agent training time and quality of the human player experience. We provide detailed guidance on how to build computational and behavioural experiments with Animal-AI 3. We present results from a series of agents, including the state-of-the-art Deep Reinforcement Learning agent (dreamer-v3), on newly designed tests and the Animal-AI Testbed of 900 tasks inspired by research in comparative psychology. Animal-AI 3 is designed to facilitate collaboration between the cognitive sciences and artificial intelligence. This paper serves as a stand-alone document that motivates, describes, and demonstrates Animal-AI 3 for the end user.
Predictable Artificial Intelligence
Zhou, Lexin, Moreno-Casares, Pablo A., Martínez-Plumed, Fernando, Burden, John, Burnell, Ryan, Cheke, Lucy, Ferri, Cèsar, Marcoci, Alexandru, Mehrbakhsh, Behzad, Moros-Daval, Yael, hÉigeartaigh, Seán Ó, Rutar, Danaja, Schellaert, Wout, Voudouris, Konstantinos, Hernández-Orallo, José
We introduce the fundamental ideas and challenges of Predictable AI, a nascent research area that explores the ways in which we can anticipate key indicators of present and future AI ecosystems. We argue that achieving predictability is crucial for fostering trust, liability, control, alignment and safety of AI ecosystems, and thus should be prioritised over performance. While distinctive from other areas of technical and non-technical AI research, the questions, hypotheses and challenges relevant to Predictable AI were yet to be clearly described. This paper aims to elucidate them, calls for identifying paths towards AI predictability and outlines the potential impact of this emergent field.
Your Prompt is My Command: On Assessing the Human-Centred Generality of Multimodal Models
Schellaert, Wout, Martínez-Plumed, Fernando, Vold, Karina, Burden, John, A. M. Casares, Pablo, Sheng Loe, Bao, Reichart, Roi, Ó hÉigeartaigh, Sean, Korhonen, Anna, Hernández-Orallo, José
Even with obvious deficiencies, large prompt-commanded multimodal models are proving to be flexible cognitive tools representing an unprecedented generality. But the directness, diversity, and degree of user interaction create a distinctive “human-centred generality” (HCG), rather than a fully autonomous one. HCG implies that —for a specific user— a system is only as general as it is effective for the user’s relevant tasks and their prevalent ways of prompting. A human-centred evaluation of general-purpose AI systems therefore needs to reflect the personal nature of interaction, tasks and cognition. We argue that the best way to understand these systems is as highly-coupled cognitive extenders, and to analyse the bidirectional cognitive adaptations between them and humans. In this paper, we give a formulation of HCG, as well as a high-level overview of the elements and trade-offs involved in the prompting process. We end the paper by outlining some essential research questions and suggestions for improving evaluation practices, which we envision as characteristic for the evaluation of general artificial intelligence in the future. This paper appears in the AI & Society track.