Goto

Collaborating Authors

 Georgescu, Raluca


Scaling Laws for Pre-training Agents and World Models

arXiv.org Artificial Intelligence

The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e.g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task \& architecture -- this has important implications on the optimal sizing of models and data.


Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games

arXiv.org Artificial Intelligence

Video games have served as useful benchmarks for the decision making community, but going beyond Atari games towards training agents in modern games has been prohibitively expensive for the vast majority of the research community. Recent progress in the research, development and open release of large vision models has the potential to amortize some of these costs across the community. However, it is currently unclear which of these models have learnt representations that retain information critical for sequential decision making. Towards enabling wider participation in the research of gameplaying agents in modern games, we present a systematic study of imitation learning with publicly available visual encoders compared to the typical, task-specific, end-to-end training approach in Minecraft, Minecraft Dungeons and Counter-Strike: Global Offensive. Figure 1: Representative screenshots of all games studied in this paper. However, video games do not only serve as benchmarks but also represent a vast entertainment industry where AI agents may eventually have applications in games development, including game testing or game design (Jacob et al., 2020; Gillberg et al., 2023). In the past, video game research often necessitated close integration with the games themselves to obtain game-specific information and establish a scalable interface for training agents. Work was conducted during an internship at Microsoft Research. To eliminate integration costs during training, we use behavior cloning to train agents entirely offline, utilising previously collected human gameplay data. Although prior research has explored encoding images into lower-dimensional representations for behavior cloning, these studies primarily targeted robotics applications (Nair et al., 2022), where images often resemble real-world scenes. Inspired by the challenges and potential applications in video games, we investigate the following research question: How can images be encoded for data-efficient imitation learning in modern video games? Towards our guiding research question, we compare both end-to-end trained visual encoders and pre-trained visual encoders in three modern video games: Minecraft, Minecraft Dungeons and Counter-Strike: Global Offensive (CS:GO).


Imitating Human Behaviour with Diffusion Models

arXiv.org Artificial Intelligence

Diffusion models have emerged as powerful generative models in the text-toimage domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment. To enable Human-AI collaboration, agents must learn to best respond to all plausible human behaviors (Dafoe et al., 2020; Mirsky et al., 2022). In simple environments, it suffices to generate all possible human behaviours (Strouse et al., 2021) but as the complexity of the environment grows this approach will struggle to scale. If we instead assume access to human behavioural data, collaborative agents can be improved by training with models of human behaviour (Carroll et al., 2019). In principle, human behavior can be modelled via imitation learning approaches in which an agent is trained to mimic the actions of a demonstrator from an offline dataset of observation and action tuples. More specifically, Behaviour Cloning (BC), despite being theoretically limited (Ross et al., 2011), has been empirically effective in domains such as autonomous driving (Pomerleau, 1991), robotics (Florence et al., 2022) and game playing (Ye et al., 2020; Pearce and Zhu, 2022). Popular approaches to BC restrict the types of distributions that can be modelled to make learning simpler. A common approach for continuous actions is to learn a point estimate, optimised via Mean Squared Error (MSE), which can be interpereted as an isotropic Gaussian of negligible variance. Another popular approach is to discretise the action space into a finite number of bins and frame as a classification problem. These both suffer due to the approximations they make (illustrated in Figure 1), either encouraging the agent to learn an'average' policy or predicting action dimensions independently resulting in'uncoordinated' behaviour (Ke et al., 2020).


Navigates Like Me: Understanding How People Evaluate Human-Like AI in Video Games

arXiv.org Artificial Intelligence

We aim to understand how people assess human likeness in navigation produced by people and artificially intelligent (AI) agents in a video game. To this end, we propose a novel AI agent with the goal of generating more human-like behavior. We collect hundreds of crowd-sourced assessments comparing the human-likeness of navigation behavior generated by our agent and baseline AI agents with human-generated behavior. Our proposed agent passes a Turing Test, while the baseline agents do not. By passing a Turing Test, we mean that human judges could not quantitatively distinguish between videos of a person and an AI agent navigating. To understand what people believe constitutes human-like navigation, we extensively analyze the justifications of these assessments. This work provides insights into the characteristics that people consider human-like in the context of goal-directed video game navigation, which is a key step for further improving human interactions with AI agents.


Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

arXiv.org Artificial Intelligence

Note: This is paper is superseded by the full version (Carroll et al., 2022). Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the FlexiBiT framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single FlexiBiT model is simultaneously capable of carrying out many tasks with performance similar to or better than specialized models. Additionally, we show that performance can be further improved by fine-tuning our general model on specific tasks of interest. Masked language modeling (Devlin et al., 2018) is a key technique in natural language processing (NLP). Under this paradigm, models are trained to predict randomly-masked subsets of tokens in a sequence.


UniMASK: Unified Inference in Sequential Decision Problems

arXiv.org Artificial Intelligence

Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline reinforcement learning, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the Uni[MASK] framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single Uni[MASK] model is often capable of carrying out many tasks with performance similar to or better than single-task models. Additionally, after fine-tuning, our Uni[MASK] models consistently outperform comparable single-task models. Our code is publicly available here.


Go-Explore Complex 3D Game Environments for Automated Reachability Testing

arXiv.org Artificial Intelligence

Modern AAA video games feature huge game levels and maps which are increasingly hard for level testers to cover exhaustively. As a result, games often ship with catastrophic bugs such as the player falling through the floor or being stuck in walls. We propose an approach specifically targeted at reachability bugs in simulated 3D environments based on the powerful exploration algorithm, Go-Explore, which saves unique checkpoints across the map and then identifies promising ones to explore from. We show that when coupled with simple heuristics derived from the game's navigation mesh, Go-Explore finds challenging bugs and comprehensively explores complex environments without the need for human demonstration or knowledge of the game dynamics. Go-Explore vastly outperforms more complicated baselines including reinforcement learning with intrinsic curiosity in both covering the navigation mesh and number of unique positions across the map discovered. Finally, due to our use of parallel agents, our algorithm can fully cover a vast 1.5km x 1.5km game world within 10 hours on a single machine making it extremely promising for continuous testing suites.


Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation

arXiv.org Artificial Intelligence

A key challenge on the path to developing agents that learn complex human-like behavior is the need to quickly and accurately quantify human-likeness. While human assessments of such behavior can be highly accurate, speed and scalability are limited. We address these limitations through a novel automated Navigation Turing Test (ANTT) that learns to predict human judgments of human-likeness. We demonstrate the effectiveness of our automated NTT on a navigation task in a complex 3D environment. We investigate six classification models to shed light on the types of architectures best suited to this task, and validate them against data collected through a human NTT. Our best models achieve high accuracy when distinguishing true human and agent behavior. At the same time, we show that predicting finer-grained human assessment of agents' progress towards human-like behavior remains unsolved. Our work takes an important step towards agents that more effectively learn complex human-like behavior.