win condition
AutoLibra: Agent Metric Induction from Open-Ended Human Feedback
Zhu, Hao, Cuvin, Phil, Yu, Xinkai, Yan, Charlotte Ka Yee, Zhang, Jason, Yang, Diyi
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose **AutoLibra**, a framework for agent evaluation, that transforms open-ended human feedback *e.g.* "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
Evaluating Language Models' Evaluations of Games
Collins, Katherine M., Zhang, Cedegao E., Todd, Graham, Ying, Lance, da Costa, Mauricio Barba, Liu, Ryan, Sharma, Prafull, Weller, Adrian, Kuperwajs, Ionatan, Wong, Lionel, Tenenbaum, Joshua B., Griffiths, Thomas L.
Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.
A Taxonomy of Collectible Card Games from a Game-Playing AI Perspective
Vieira, Ronaldo e Silva, Tavares, Anderson Rocha, Chaimowicz, Luiz
Collectible card games are challenging, widely played games that have received increasing attention from the AI research community in recent years. Despite important breakthroughs, the field still poses many unresolved challenges. This work aims to help further research on the genre by proposing a taxonomy of collectible card games by analyzing their rules, mechanics, and game modes from the perspective of game-playing AI research. To achieve this, we studied a set of popular games and provided a thorough discussion about their characteristics.
AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game
Chi, Yizhou, Mao, Lingjun, Tang, Zineng
Strategic social deduction games serve as valuable testbeds for evaluating the understanding and inference skills of language models, offering crucial insights into social science, artificial intelligence, and strategic gaming. This paper focuses on creating proxies of human behavior in simulated environments, with Among Us utilized as a tool for studying simulated human behavior. The study introduces a text-based game environment, named AmongAgents, that mirrors the dynamics of Among Us. Players act as crew members aboard a spaceship, tasked with identifying impostors who are sabotaging the ship and eliminating the crew. Within this environment, the behavior of simulated language agents is analyzed. The experiments involve diverse game sequences featuring different configurations of Crewmates and Impostor personality archetypes. Our work demonstrates that state-of-the-art large language models (LLMs) can effectively grasp the game rules and make decisions based on the current context. This work aims to promote further exploration of LLMs in goal-oriented games with incomplete information and complex action spaces, as these settings offer valuable opportunities to assess language model performance in socially driven scenarios.
People use fast, goal-directed simulation to reason about novel games
Zhang, Cedegao E., Collins, Katherine M., Wong, Lionel, Weller, Adrian, Tenenbaum, Joshua B.
We can evaluate features of problems and their potential solutions well before we can effectively solve them. When considering a game we have never played, for instance, we might infer whether it is likely to be challenging, fair, or fun simply from hearing the game rules, prior to deciding whether to invest time in learning the game or trying to play it well. Many studies of game play have focused on optimality and expertise, characterizing how people and computational models play based on moderate to extensive search and after playing a game dozens (if not thousands or millions) of times. Here, we study how people reason about a range of simple but novel connect-n style board games. We ask people to judge how fair and how fun the games are from very little experience: just thinking about the game for a minute or so, before they have ever actually played with anyone else, and we propose a resource-limited model that captures their judgments using only a small number of partial game simulations and almost no lookahead search.
MIT Researcher Explores The Downside Of Machine Learning In Healthcare - Liwaiwai
While working toward her dissertation in Computer Science, Marzyeh Ghassemi PhD '17 wrote some papers on how machine learning techniques from AI could be applied to clinical data in order to predict patient outcomes. "It wasn't until the end of my PhD work that one of my committee members asked: 'Did you ever check to see how well your model worked across different groups of people?'" That question was eye-opening for Ghassemi, who had previously assessed the performance of models in aggregate, across all patients. Upon a closer look, she saw that models often worked differently, specifically worse, for minorities like black women--a revelation that took her by surprise. "I hadn't made the connection beforehand that health disparities would translate directly to model disparities," she says. "And given that I am a visible minority woman-identifying computer scientist at MIT, I am reasonably certain that many others weren't aware of this either."
Hidden biases in medical data could compromise AI approaches to healthcare
While working toward her dissertation in computer science at MIT, Marzyeh Ghassemi wrote several papers on how machine-learning techniques from artificial intelligence could be applied to clinical data in order to predict patient outcomes. "It wasn't until the end of my Ph.D. work that one of my committee members asked: "Did you ever check to see how well your model worked across different groups of people?'" That question was eye-opening for Ghassemi, who had previously assessed the performance of models in aggregate, across all patients. Upon a closer look, she saw that models often worked differently--specifically worse--for populations including Black women, a revelation that took her by surprise. "I hadn't made the connection beforehand that health disparities would translate directly to model disparities," she says. "And given that I am a visible minority woman-identifying computer scientist at MIT, I am reasonably certain that many others weren't aware of this either." In a paper published Jan. 14 in the journal Patterns, Ghassemi--who earned her doctorate in 2017 and is now an assistant professor in the Department of Electrical Engineering and Computer Science and the MIT Institute for Medical Engineering and Science (IMES)--and her coauthor, Elaine Okanyene Nsoesie of Boston University, offer a cautionary note about the prospects for AI in medicine. "If used carefully, this technology could improve performance in health care and potentially reduce inequities," Ghassemi says. "But if we're not actually careful, technology could worsen care." It all comes down to data, given that the AI tools in question train themselves by processing and analyzing vast quantities of data. But the data they are given are produced by humans, who are fallible and whose judgments may be clouded by the fact that they interact differently with patients depending on their age, gender, and race, without even knowing it. Furthermore, there is still great uncertainty about medical conditions themselves. "Doctors trained at the same medical school for 10 years can, and often do, disagree about a patient's diagnosis," Ghassemi says. That's different from the applications where existing machine-learning algorithms excel--like object-recognition tasks--because practically everyone in the world will agree that a dog is, in fact, a dog. Machine-learning algorithms have also fared well in mastering games like chess and Go, where both the rules and the "win conditions" are clearly defined. Physicians, however, don't always concur on the rules for treating patients, and even the win condition of being "healthy" is not widely agreed upon. "Doctors know what it means to be sick," Ghassemi explains, "and we have the most data for people when they are sickest.
The downside of machine learning in health care
While working toward her dissertation in computer science at MIT, Marzyeh Ghassemi wrote several papers on how machine-learning techniques from artificial intelligence could be applied to clinical data in order to predict patient outcomes. "It wasn't until the end of my PhD work that one of my committee members asked: 'Did you ever check to see how well your model worked across different groups of people?'" That question was eye-opening for Ghassemi, who had previously assessed the performance of models in aggregate, across all patients. Upon a closer look, she saw that models often worked differently -- specifically worse -- for populations including Black women, a revelation that took her by surprise. "I hadn't made the connection beforehand that health disparities would translate directly to model disparities," she says. "And given that I am a visible minority woman-identifying computer scientist at MIT, I am reasonably certain that many others weren't aware of this either."
Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria
Kopparapu, Kavya, Duéñez-Guzmán, Edgar A., Matyas, Jayd, Vezhnevets, Alexander Sasha, Agapiou, John P., McKee, Kevin R., Everett, Richard, Marecki, Janusz, Leibo, Joel Z., Graepel, Thore
A key challenge in the study of multiagent cooperation is the need for individual agents not only to cooperate effectively, but to decide with whom to cooperate. This is particularly critical in situations when other agents have hidden, possibly misaligned motivations and goals. Social deduction games offer an avenue to study how individuals might learn to synthesize potentially unreliable information about others, and elucidate their true motivations. In this work, we present Hidden Agenda, a two-team social deduction game that provides a 2D environment for studying learning agents in scenarios of unknown team alignment. The environment admits a rich set of strategies for both teams. Reinforcement learning agents trained in Hidden Agenda show that agents can learn a variety of behaviors, including partnering and voting without need for communication in natural language.
A Robot that Learns Connect Four Using Game Theory and Demonstrations
Teaching robots new skills using minimal time and effort has long been a goal of artificial intelligence. This paper investigates the use of game theoretic representations to represent and learn how to play interactive games such as Connect Four. We combine aspects of learning by demonstration, active learning, and game theory allowing a robot to learn by presenting its understanding of the structure of the game and conducting a question/answer session with a person. The paper demonstrates how a robot can be taught the win conditions of the game Connect Four and its variants using a single demonstration and a few trial examples with a question and answer session led by the robot. Our results show that the robot can learn any arbitrary win conditions for the Connect Four game without any prior knowledge of the win conditions and then play the game with a human utilizing the learned win conditions. Our experiments also show that some questions are more important for learning the game's win conditions.