Harley, Tim
Scaling Instructable Agents Across Many Simulated Worlds
SIMA Team, null, Raad, Maria Abi, Ahuja, Arun, Barros, Catarina, Besse, Frederic, Bolt, Andrew, Bolton, Adrian, Brownfield, Bethanie, Buttimore, Gavin, Cant, Max, Chakera, Sarah, Chan, Stephanie C. Y., Clune, Jeff, Collister, Adrian, Copeman, Vikki, Cullum, Alex, Dasgupta, Ishita, de Cesare, Dario, Di Trapani, Julia, Donchev, Yani, Dunleavy, Emma, Engelcke, Martin, Faulkner, Ryan, Garcia, Frankie, Gbadamosi, Charles, Gong, Zhitao, Gonzales, Lucy, Gupta, Kshitij, Gregor, Karol, Hallingstad, Arne Olav, Harley, Tim, Haves, Sam, Hill, Felix, Hirst, Ed, Hudson, Drew A., Hudson, Jony, Hughes-Fitt, Steph, Rezende, Danilo J., Jasarevic, Mimi, Kampis, Laura, Ke, Rosemary, Keck, Thomas, Kim, Junkyung, Knagg, Oscar, Kopparapu, Kavya, Lampinen, Andrew, Legg, Shane, Lerchner, Alexander, Limont, Marjorie, Liu, Yulan, Loks-Thompson, Maria, Marino, Joseph, Cussons, Kathryn Martin, Matthey, Loic, Mcloughlin, Siobhan, Mendolicchio, Piermaria, Merzic, Hamza, Mitenkova, Anna, Moufarek, Alexandre, Oliveira, Valeria, Oliveira, Yanko, Openshaw, Hannah, Pan, Renke, Pappu, Aneesh, Platonov, Alex, Purkiss, Ollie, Reichert, David, Reid, John, Richemond, Pierre Harvey, Roberts, Tyson, Ruscoe, Giles, Elias, Jaume Sanchez, Sandars, Tasha, Sawyer, Daniel P., Scholtes, Tim, Simmons, Guy, Slater, Daniel, Soyer, Hubert, Strathmann, Heiko, Stys, Peter, Tam, Allison C., Teplyashin, Denis, Terzi, Tayfun, Vercelli, Davide, Vujatovic, Bojan, Wainwright, Marcus, Wang, Jane X., Wang, Zhengdong, Wierstra, Daan, Williams, Duncan, Wong, Nathaniel, York, Sarah, Young, Nick
Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.
Imitating Interactive Intelligence
Abramson, Josh, Ahuja, Arun, Brussee, Arthur, Carnevale, Federico, Cassin, Mary, Clark, Stephen, Dudzik, Andrew, Georgiev, Petko, Guy, Aurelia, Harley, Tim, Hill, Felix, Hung, Alden, Kenton, Zachary, Landon, Jessica, Lillicrap, Timothy, Mathewson, Kory, Muldal, Alistair, Santoro, Adam, Savinov, Nikolay, Varma, Vikrant, Wayne, Greg, Wong, Nathaniel, Yan, Chen, Zhu, Rui
A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. This setting nevertheless integrates a number of the central challenges of artificial intelligence (AI) research: complex visual perception and goal-directed physical control, grounded language comprehension and production, and multi-agent social interaction. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. However, this is presently impractical. Therefore, we approximate the role of the human with another learned agent, and use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour. Rigorously evaluating our agents poses a great challenge, so we develop a variety of behavioural tests, including evaluation by humans who watch videos of agents or interact directly with them. These evaluations convincingly demonstrate that interactive training and auxiliary losses improve agent behaviour beyond what is achieved by supervised learning of actions alone. Further, we demonstrate that agent capabilities generalise beyond literal experiences in the dataset. Finally, we train evaluation models whose ratings of agents agree well with human judgement, thus permitting the evaluation of new agent models without additional effort. Taken together, our results in this virtual environment provide evidence that large-scale human behavioural imitation is a promising tool to create intelligent, interactive agents, and the challenge of reliably evaluating such agents is possible to surmount.
A Generalized Framework for Population Based Training
Li, Ang, Spyra, Ola, Perel, Sagi, Dalibard, Valentin, Jaderberg, Max, Gu, Chenjie, Budden, David, Harley, Tim, Gupta, Pramod
Population Based Training (PBT) is a recent approach that jointly optimizes neural network weights and hyperparameters which periodically copies weights of the best performers and mutates hyperparameters during training. Previous PBT implementations have been synchronized glass-box systems. We propose a general, black-box PBT framework that distributes many asynchronous "trials" (a small number of training steps with warm-starting) across a cluster, coordinated by the PBT controller. The black-box design does not make assumptions on model architectures, loss functions or training procedures. Our system supports dynamic hyperparameter schedules to optimize both differentiable and non-differentiable metrics. We apply our system to train a state-of-the-art WaveNet generative model for human voice synthesis. We show that our PBT system achieves better accuracy, less sensitivity and faster convergence compared to existing methods, given the same computational resource.
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih, Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain, Legg, Shane, Kavukcuoglu, Koray
In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach.
Unsupervised Predictive Memory in a Goal-Directed Agent
Wayne, Greg, Hung, Chia-Chun, Amos, David, Mirza, Mehdi, Ahuja, Arun, Grabska-Barwinska, Agnieszka, Rae, Jack, Mirowski, Piotr, Leibo, Joel Z., Santoro, Adam, Gemici, Mevlana, Reynolds, Malcolm, Harley, Tim, Abramson, Josh, Mohamed, Shakir, Rezende, Danilo, Saxton, David, Cain, Adam, Hillier, Chloe, Silver, David, Kavukcuoglu, Koray, Botvinick, Matt, Hassabis, Demis, Lillicrap, Timothy
Animals execute goal-directed behaviours despite the limited range and scope of their sensors. To cope, they explore environments and store memories maintaining estimates of important information that is not presently available. Recently, progress has been made with artificial intelligence (AI) agents that learn to perform tasks from sensory input, even at a human level, by merging reinforcement learning (RL) algorithms with deep neural networks, and the excitement surrounding these results has led to the pursuit of related ideas as explanations of non-human animal learning. However, we demonstrate that contemporary RL algorithms struggle to solve simple tasks when enough information is concealed from the sensors of the agent, a property called "partial observability". An obvious requirement for handling partially observed tasks is access to extensive memory, but we show memory is not enough; it is critical that the right information be stored in the right format. We develop a model, the Memory, RL, and Inference Network (MERLIN), in which memory formation is guided by a process of predictive modeling. MERLIN facilitates the solution of tasks in 3D virtual reality environments for which partial observability is severe and memories must be maintained over long durations. Our model demonstrates a single learning agent architecture that can solve canonical behavioural tasks in psychology and neurobiology without strong simplifying assumptions about the dimensionality of sensory input or the duration of experiences.
The Predictron: End-To-End Learning and Planning
Silver, David, van Hasselt, Hado, Hessel, Matteo, Schaul, Tom, Guez, Arthur, Harley, Tim, Dulac-Arnold, Gabriel, Reichert, David, Rabinowitz, Neil, Barreto, Andre, Degris, Thomas
One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple "imagined" planning steps. Each forward pass of the predictron accumulates internal rewards and values over multiple planning depths. The predictron is trained end-to-end so as to make these accumulated values accurately approximate the true value function. We applied the predictron to procedurally generated random mazes and a simulator for the game of pool. The predictron yielded significantly more accurate predictions than conventional deep neural network architectures.