Goto

Collaborating Authors

 Foley, John


Toybox: A Suite of Environments for Experimental Evaluation of Deep Reinforcement Learning

arXiv.org Machine Learning

While ALE has enabled demonstration and evaluation of much more complex behaviors of deep RL agents, it Evaluation of deep reinforcement learning (RL) presents challenges as a suite of evaluation environments is inherently challenging. In particular, learned for topics on the frontier of deep RL. policies are largely opaque, and hypotheses about Challenge: Limited variation within games. Very little about the behavior of deep RL agents are difficult to individual games can be systematically altered, so ALE is test in black-box environments. Considerable effort poorly suited to testing how changes in the environment has gone into addressing opacity, but almost affect training and performance. New benchmarks such as no effort has been devoted to producing highquality OpenAI's Sonic the Hedgehog emulator and CoinRun inject environments for experimental evaluation environmental variation into the training schedule, while of agent behavior.


Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments

arXiv.org Artificial Intelligence

Reproducibility in reinforcement learning is challenging: uncontrolled stochasticity from many sources, such as the learning algorithm, the learned policy, and the environment itself have led researchers to report the performance of learned agents using aggregate metrics of performance over multiple random seeds for a single environment. Unfortunately, there are still pernicious sources of variability in reinforcement learning agents that make reporting common summary statistics an unsound metric for performance. Our experiments demonstrate the variability of common agents used in the popular OpenAI Baselines repository. We make the case for reporting post-training agent performance as a distribution, rather than a point estimate.


ToyBox: Better Atari Environments for Testing Reinforcement Learning Agents

arXiv.org Artificial Intelligence

It is a widely accepted principle that software without tests has bugs. Testing reinforcement learning agents is especially difficult because of the stochastic nature of both agents and environments, the complexity of state-of-the-art models, and the sequential nature of their predictions. Recently, the Arcade Learning Environment (ALE) has become one of the most widely used benchmark suites for deep learning research, and state-of-the-art Reinforcement Learning (RL) agents have been shown to routinely equal or exceed human performance on many ALE tasks. Since ALE is based on emulation of original Atari games, the environment does not provide semantically meaningful representations of internal game state. This means that ALE has limited utility as an environment for supporting testing or model introspection. We propose TOYBOX, a collection of reimplementations of these games that solves this critical problem and enables robust testing of RL agents.