Costello, Gavin
Navigates Like Me: Understanding How People Evaluate Human-Like AI in Video Games
Milani, Stephanie, Juliani, Arthur, Momennejad, Ida, Georgescu, Raluca, Rzpecki, Jaroslaw, Shaw, Alison, Costello, Gavin, Fang, Fei, Devlin, Sam, Hofmann, Katja
We aim to understand how people assess human likeness in navigation produced by people and artificially intelligent (AI) agents in a video game. To this end, we propose a novel AI agent with the goal of generating more human-like behavior. We collect hundreds of crowd-sourced assessments comparing the human-likeness of navigation behavior generated by our agent and baseline AI agents with human-generated behavior. Our proposed agent passes a Turing Test, while the baseline agents do not. By passing a Turing Test, we mean that human judges could not quantitatively distinguish between videos of a person and an AI agent navigating. To understand what people believe constitutes human-like navigation, we extensively analyze the justifications of these assessments. This work provides insights into the characteristics that people consider human-like in the context of goal-directed video game navigation, which is a key step for further improving human interactions with AI agents.
Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation
Devlin, Sam, Georgescu, Raluca, Momennejad, Ida, Rzepecki, Jaroslaw, Zuniga, Evelyn, Costello, Gavin, Leroy, Guy, Shaw, Ali, Hofmann, Katja
A key challenge on the path to developing agents that learn complex human-like behavior is the need to quickly and accurately quantify human-likeness. While human assessments of such behavior can be highly accurate, speed and scalability are limited. We address these limitations through a novel automated Navigation Turing Test (ANTT) that learns to predict human judgments of human-likeness. We demonstrate the effectiveness of our automated NTT on a navigation task in a complex 3D environment. We investigate six classification models to shed light on the types of architectures best suited to this task, and validate them against data collected through a human NTT. Our best models achieve high accuracy when distinguishing true human and agent behavior. At the same time, we show that predicting finer-grained human assessment of agents' progress towards human-like behavior remains unsolved. Our work takes an important step towards agents that more effectively learn complex human-like behavior.