Goto

Collaborating Authors

 Reinforcement Learning


Emergent Tool Use From Multi-Agent Autocurricula

arXiv.org Artificial Intelligence

Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.


Control Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforcement Learning

arXiv.org Artificial Intelligence

Arrows: actions top, left, down, and right; encircled characters: state labels. The actions in states that are not reachable or lead to another LDBA state are not displayed. In all subfigures, the most likely paths are highlighted in red. the baby b, the only allowed action is left and when taken the following situations can happen: (i) the robot hits the wall with probability 0.1 and wakes the baby up; (ii) the robot moves left with probability 0. 8 or moves down with probability 0.1 . If the baby has been woken up, which means the robot could not leave in a single time step (represented by L TL as b null b), the robot should notify the adult (at state a); otherwise, the robot should directly go back to the charger (at state c). The full objective is specified in L TL as ϕ 2 nullnull d nullnullnullnull (1) (b null b) null ( b U (a c)) null nullnull null (2) a null ( a U b) null nullnull null (3) ( b null b nullnull b) ( a U c) null nullnull null (4) c ( a U b) null nullnull null (5) (b null b) a null nullnull null (6) null .


Exploring Apprenticeship Learning for Player Modelling in Interactive Narratives

arXiv.org Artificial Intelligence

In this paper we present an early Apprenticeship Learning approach to mimic the behaviour of different players in a short adaption of the interactive fiction Anchorhead. Our motivation is the need to understand and simulate player behaviour to create systems to aid the design and person-alisation of Interactive Narratives (INs). INs are partially observable for the players and their goals are dynamic as a result. We used Receding Horizon IRL (RHIRL) to learn players' goals in the form of reward functions, and derive policies to imitate their behaviour. Our preliminary results suggest that RHIRL is able to learn action sequences to complete a game, and provided insights towards generating behaviour more similar to specific players.


Leveraging human Domain Knowledge to model an empirical Reward function for a Reinforcement Learning problem

arXiv.org Artificial Intelligence

Traditional Reinforcement Learning (RL) problems depend on an exhaustive simulation environment that models real-world physics of the problem and trains the RL agent by observing this environment. In this paper, we present a novel approach to creating an environment by modeling the reward function based on empirical rules extracted from human domain knowledge of the system under study. Using this empirical rewards function, we will build an environment and train the agent. We will first create an environment that emulates the effect of setting cabin temperature through thermostat. This is typically done in RL problems by creating an exhaustive model of the system with detailed thermodynamic study. Instead, we propose an empirical approach to model the reward function based on human domain knowledge. We will document some rules of thumb that we usually exercise as humans while setting thermostat temperature and try and model these into our reward function. This modeling of empirical human domain rules into a reward function for RL is the unique aspect of this paper. This is a continuous action space problem and using deep deterministic policy gradient (DDPG) method, we will solve for maximizing the reward function. We will create a policy network that predicts optimal temperature setpoint given external temperature and humidity.


Training a Goal-Oriented Chatbot with Deep Reinforcement Learning -- Part III

#artificialintelligence

The dialogue state tracker or just state tracker (ST) in a goal-oriented dialogue system has the primary job of preparing the state for the agent. As we discussed in the previous part the agent needs a useful state to be able to make a good choice on what action to take. The ST updates its internal history of the dialogue by collecting both user and agent actions as they are taken. It also keeps track of all inform slots that have been contained in any agent and user actions thus far in the current episode. The state used by the agent is a numpy array made of information from the current history and the current informs of the ST. In addition, whenever the agent wishes to inform a slot to the user the ST queries the database for a value that works given its current informs.


Practical Reinforcement Learning with TensorFlow 2.0 & TF-Agents

#artificialintelligence

Here is a short video of Orso's minimal path through his turf. Notice how he avoids troublesome water in favor of longer paths through grassy lands. In our Half Day Hands-on Training At ODSC West in San Francisco we will show you more details on how you can use reinforcement learning practically. Using Colab notebooks we will model problems as a simulation environment (Orso's world) and train your agent (Orso himself) to learn a good strategy.


Model Based Planning with Energy Based Models

arXiv.org Machine Learning

Model-based planning holds great promise for improving both sample efficiency and generalization in reinforcement learning (RL). We show that energy-based models (EBMs) are a promising class of models to use for model-based planning. EBMs naturally support inference of intermediate states given start and goal state distributions. We provide an online algorithm to train EBMs while interacting with the environment, and show that EBMs allow for significantly better online learning than corresponding feed-forward networks. We further show that EBMs support maximum entropy state inference and are able to generate diverse state space plans. We show that inference purely in state space - without planning actions - allows for better generalization to previously unseen obstacles in the environment and prevents the planner from exploiting the dynamics model by applying uncharacteristic action sequences. Finally, we show that online EBM training naturally leads to intentionally planned state exploration which performs significantly better than random exploration.


Biased Estimates of Advantages over Path Ensembles

arXiv.org Machine Learning

The estimation of advantage is crucial for a number of reinforcement learning algorithms, as it directly influences the choices of future paths. In this work, we propose a family of estimates based on the order statistics over the path ensemble, which allows one to flexibly drive the learning process, towards or against risks. On top of this formulation, we systematically study the impacts of different methods for estimating advantages. Our findings reveal that biased estimates, when chosen appropriately, can result in significant benefits. In particular, for the environments with sparse rewards, optimistic estimates would lead to more efficient exploration of the policy space; while for those where individual actions can have critical impacts, conservative estimates are preferable. On various benchmarks, including MuJoCo continuous control, Terrain locomotion, Atari games, and sparse-reward environments, the proposed biased estimation schemes consistently demonstrate improvement over mainstream methods, not only accelerating the learning process but also obtaining substantial performance gains.


Wield: Systematic Reinforcement Learning With Progressive Randomization

arXiv.org Machine Learning

Reinforcement learning frameworks have introduced abstractions to implement and execute algorithms at scale. They assume standardized simulator interfaces but are not concerned with identifying suitable task representations. We present Wield, a first-of-its kind system to facilitate task design for practical reinforcement learning. Through software primitives, Wield enables practitioners to decouple system-interface and deployment-specific configuration from state and action design. To guide experimentation, Wield further introduces a novel task design protocol and classification scheme centred around staged randomization to incrementally evaluate model capabilities.


VILD: Variational Imitation Learning with Diverse-quality Demonstrations

arXiv.org Machine Learning

The goal of imitation learning (IL) is to learn a good policy from high-quality demonstrations. However, the quality of demonstrations in reality can be diverse, since it is easier and cheaper to collect demonstrations from a mix of experts and amateurs. IL in such situations can be challenging, especially when the level of demonstrators' expertise is unknown. We propose a new IL method called \underline{v}ariational \underline{i}mitation \underline{l}earning with \underline{d}iverse-quality demonstrations (VILD), where we explicitly model the level of demonstrators' expertise with a probabilistic graphical model and estimate it along with a reward function. We show that a naive approach to estimation is not suitable to large state and action spaces, and fix its issues by using a variational approach which can be easily implemented using existing reinforcement learning methods. Experiments on continuous-control benchmarks demonstrate that VILD outperforms state-of-the-art methods. Our work enables scalable and data-efficient IL under more realistic settings than before.