Reinforcement Learning
RLgraph: Flexible Computation Graphs for Deep Reinforcement Learning
Schaarschmidt, Michael, Mika, Sven, Fricke, Kai, Yoneki, Eiko
Reinforcement learning (RL) tasks are challenging to implement, execute and test due to algorithmic instability, hyper-parameter sensitivity, and heterogeneous distributed communication patterns. We argue for the separation of logical component composition, backend graph definition, and distributed execution. To this end, we introduce RLgraph, a library for designing and executing high performance RL computation graphs in both static graph and define-by-run paradigms. The resulting implementations yield high performance across different deep learning frameworks and distributed backends.
The Building Blocks of Reinforcement Learning: Deep Open Sources TRFL
Deep reinforcement learning(DRL) has been categorized many times as the future of artificial intelligence(AI). Some of the most important AI breakthroughs of the last few years such as DeepMind's AlphaGo or OpenAI's Dota Five have been based on DRL applications. Despite its importance, the implementation of DRL models remains an incredibly challenging exercise and, for the most part, we have very little ideas about the pieces that make an efficient DRL solution. Earlier this week, DeepMind open sourced TRFL(pronounced truffle, of course), a framework that compiles a series of useful building blocks of DRL models. Most of the current wave of DRL methods have had their origin in the academic environments and they haven't been tested in real world implementations.
Supervising strong learners by amplifying weak experts
Christiano, Paul, Shlegeris, Buck, Amodei, Dario
Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., 2017; Silver et al., 2017b), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.
Transfer Learning versus Multi-agent Learning regarding Distributed Decision-Making in Highway Traffic
Schutera, Mark, Goby, Niklas, Neumann, Dirk, Reischl, Markus
Transportation and traffic are currently undergoing a rapid increase in terms of both scale and complexity. At the same time, an increasing share of traffic participants are being transformed into agents driven or supported by artificial intelligence resulting in mixed-intelligence traffic. This work explores the implications of distributed decision-making in mixed-intelligence traffic. The investigations are carried out on the basis of an online-simulated highway scenario, namely the MIT \emph{DeepTraffic} simulation. In the first step traffic agents are trained by means of a deep reinforcement learning approach, being deployed inside an elitist evolutionary algorithm for hyperparameter search. The resulting architectures and training parameters are then utilized in order to either train a single autonomous traffic agent and transfer the learned weights onto a multi-agent scenario or else to conduct multi-agent learning directly. Both learning strategies are evaluated on different ratios of mixed-intelligence traffic. The strategies are assessed according to the average speed of all agents driven by artificial intelligence. Traffic patterns that provoke a reduction in traffic flow are analyzed with respect to the different strategies.
ProMP: Proximal Meta-Policy Search
Rothfuss, Jonas, Lee, Dennis, Clavera, Ignasi, Asfour, Tamim, Abbeel, Pieter
Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.
Multi-Agent Fully Decentralized Off-Policy Learning with Linear Convergence Rates
Cassano, Lucas, Yuan, Kun, Sayed, Ali H.
In this paper we develop a fully decentralized algorithm for policy evaluation with off-policy learning, linear function approximation, and $O(n)$ complexity in both computation and memory requirements. The proposed algorithm is of the variance reduced kind and achieves linear convergence. We consider the case where a collection of agents have distinct and fixed size datasets gathered following different behavior policies (none of which is required to explore the full state space) and they all collaborate to evaluate a common target policy. The network approach allows all agents to converge to the optimal solution even in situations where neither agent can converge on its own without cooperation. We provide simulations to illustrate the effectiveness of the method.
Holodeck - High Fidelity Simulator for Reinforcement Learning and Robotics Research.
Here you are presented to the release the first public version of a high-fidelity simulator that has been built on the top of Unreal Engine 4 (UE4) called Holodeck, a python package that can be made use for research, classes or even fun! Holodeck is a python package which provides its users with the ability to download pre-built worlds, and also interact with them through a simple, high-level interface. At present, the release comprises of a simple sphere robot, a UAV (quadcopter), an Android, and a navigation agent. It also comes with 6 diverse default worlds. On what principles is Holodeck built in?
Integrating kinematics and environment context into deep inverse reinforcement learning for predicting off-road vehicle trajectories
Zhang, Yanfu, Wang, Wenshan, Bonatti, Rogerio, Maturana, Daniel, Scherer, Sebastian
Predicting the motion of a mobile agent from a third-person perspective is an important component for many robotics applications, such as autonomous navigation and tracking. With accurate motion prediction of other agents, robots can plan for more intelligent behaviors to achieve specified objectives, instead of acting in a purely reactive way. Previous work addresses motion prediction by either only filtering kinematics, or using hand-designed and learned representations of the environment. Instead of separating kinematic and environmental context, we propose a novel approach to integrate both into an inverse reinforcement learning (IRL) framework for trajectory prediction. Instead of exponentially increasing the state-space complexity with kinematics, we propose a two-stage neural network architecture that considers motion and environment together to recover the reward function. The first-stage network learns feature representations of the environment using low-level LiDAR statistics and the second-stage network combines those learned features with kinematics data. We collected over 30 km of off-road driving data and validated experimentally that our method can effectively extract useful environmental and kinematic features. We generate accurate predictions of the distribution of future trajectories of the vehicle, encoding complex behaviors such as multi-modal distributions at road intersections, and even show different predictions at the same intersection depending on the vehicle's speed.
Simple Regret Minimization for Contextual Bandits
Deshmukh, Aniket Anand, Sharma, Srinagesh, Cutler, James W., Moldwin, Mark, Scott, Clayton
There are two variants of the classical multi-armed bandit (MAB) problem that have received considerable attention from machine learning researchers in recent years: contextual bandits and simple regret minimization. Contextual bandits are a sub-class of MABs where, at every time step, the learner has access to side information that is predictive of the best arm. Simple regret minimization assumes that the learner only incurs regret after a pure exploration phase. In this work, we study simple regret minimization for contextual bandits. Motivated by applications where the learner has separate training and autonomous modes, we assume that, the learner experiences a pure exploration phase, where feedback is received after every action but no regret is incurred, followed by a pure exploitation phase in which regret is incurred but there is no feedback. We present the Contextual-Gap algorithm and establish performance guarantees on the simple regret, i.e., the regret during the pure exploitation phase. Our experiments examine a novel application to adaptive sensor selection for magnetic field estimation in interplanetary spacecraft, and demonstrate considerable improvement over algorithms designed to minimize the cumulative regret.
At Human Speed: Deep Reinforcement Learning with Action Delay
Firoiu, Vlad, Ju, Tina, Tenenbaum, Josh
There has been a recent explosion in the capabilities of game-playing artificial intelligence. Many classes of tasks, from video games to motor control to board games, are now solvable by fairly generic algorithms, based on deep learning and reinforcement learning, that learn to play from experience with minimal prior knowledge. However, these machines often do not win through intelligence alone -- they possess vastly superior speed and precision, allowing them to act in ways a human never could. To level the playing field, we restrict the machine's reaction time to a human level, and find that standard deep reinforcement learning methods quickly drop in performance. We propose a solution to the action delay problem inspired by human perception -- to endow agents with a neural predictive model of the environment which "undoes" the delay inherent in their environment -- and demonstrate its efficacy against professional players in Super Smash Bros. Melee, a popular console fighting game.