Reinforcement Learning
Learning Curriculum Policies for Reinforcement Learning
Narvekar, Sanmit, Stone, Peter
Curriculum learning in reinforcement learning is a training methodology that seeks to speed up learning of a difficult target task, by first training on a series of simpler tasks and transferring the knowledge acquired to the target task. Automatically choosing a sequence of such tasks (i.e. a curriculum) is an open problem that has been the subject of much recent work in this area. In this paper, we build upon a recent method for curriculum design, which formulates the curriculum sequencing problem as a Markov Decision Process. We extend this model to handle multiple transfer learning algorithms, and show for the first time that a curriculum policy over this MDP can be learned from experience. We explore various representations that make this possible, and evaluate our approach by learning curriculum policies for multiple agents in two different domains. The results show that our method produces curricula that can train agents to perform on a target task as fast or faster than existing methods.
Discovering hierarchies using Imitation Learning from hierarchy aware policies
Deshpande, Ameet, K, Harshavardhan P, Ravindran, Balaraman
Learning options that allow agents to exhibit temporally higher order behavior has proven to be useful in increasing exploration, reducing sample complexity and for various transfer scenarios. Deep Discovery of Options (DDO) is a generative algorithm that learns a hierarchical policy along with options directly from expert trajectories. We perform a qualitative and quantitative analysis of options inferred from DDO in different domains. To this end, we suggest different value metrics like option termination condition, hinge value function error and KL-Divergence based distance metric to compare different methods. Analyzing the termination condition of the options and number of time steps the options were run revealed that the options were terminating prematurely. We suggest modifications which can be incorporated easily and alleviates the problem of shorter options and a collapse of options to the same mode.
Amazon tempts developers to machine learning with toy race car
The DeepRacer includes a built-in compute section featuring an Intel Atom processor, 4 GB of RAM, 32 GB of internal storage and comes loaded with Ubuntu OS, Intel OpenVINO computer vision toolkit and ROS Kinetic (robot operating system). It's been designed to get developers into reinforcement learning, a form of machine learning that uses trial and error to achieve goals and successful outcomes.
Control with Distributed Deep Reinforcement Learning: Learn a Better Policy
Liu, Qihao, Liu, Xiaofeng, Cai, Guoping
Abstract: Distributed approach is a very effective method to improve training efficiency of reinforcement learning. In this paper, we propose a new heuristic distributed architecture for deep reinforcement learning (DRL) algorithm, in which a PSO based network update mechanism is adopted to speed up learning an optimal policy besides using multiple agents for parallel training. In this mechanism, the update of neural network of each agent is not only according to the training result of itself, but also affected by the optimal neural network of all agents. In order to verify the effectiveness of the proposed method, the proposed architecture is implemented on the Deep Q-Network algorithm (DQN) and the Deep Deterministic Policy Gradient algorithm (DDPG) to train several typical control problems. The training results show that the proposed method is effective. Reinforcement learning is about an agent interacting with the environment, learning an optimal policy by trial and error.
Modeling natural language emergence with integral transform theory and reinforcement learning
Khomtchouk, Bohdan, Sudhakaran, Shyam
Zipf's law predicts a power-law relationship between word rank and frequency in language communication systems and has been widely reported in a variety of natural language processing applications. However, the emergence of natural language is often modeled as a function of bias between speaker and listener interests, which lacks a direct way of relating information-theoretic bias to Zipfian rank. A function of bias also serves as an unintuitive interpretation of the communicative effort exchanged between a speaker and a listener. We counter these shortcomings by proposing a novel integral transform and kernel for mapping communicative bias functions to corresponding word frequency-rank representations at any arbitrary phase transition point, resulting in a direct way to link communicative effort (modeled by speaker/listener bias) to specific vocabulary used (represented by word rank). We demonstrate the practical utility of our integral transform by showing how a change from bias to rank results in greater accuracy and performance at an image classification task for assigning word labels to images randomly subsampled from CIFAR10. We model this task as a reinforcement learning game between a speaker and listener and compare the relative impact of bias and Zipfian word rank on communicative performance (and accuracy) between the two agents.
BlockPuzzle - A Challenge in Physical Reasoning and Generalization for Robot Learning
In this work we propose a novel task framework under which a variety of physical reasoning puzzles can be constructed using very simple rules. Under sparse reward settings, most of these tasks can be very challenging for a reinforcement learning agent to learn. We build several simple environments with this task framework in Mujoco and OpenAI gym and attempt to solve them. We are able to solve the environments by designing curricula to guide the agent in learning and using imitation learning methods to transfer knowledge from a simpler environment. This is only a first step for the task framework, and further research on how to solve the harder tasks and transfer knowledge between tasks is needed.
Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL
Kartal, Bilal, Hernandez-Leal, Pablo, Taylor, Matthew E.
Deep reinforcement learning (DRL) has achieved great successes in recent years with the help of novel methods and higher compute power. However, there are still several challenges to be addressed such as convergence to locally optimal policies and long training times. In this paper, firstly, we augment Asynchronous Advantage Actor-Critic (A3C) method with a novel self-supervised auxiliary task, i.e. \emph{Terminal Prediction}, measuring temporal closeness to terminal states, namely A3C-TP. Secondly, we propose a new framework where planning algorithms such as Monte Carlo tree search or other sources of (simulated) demonstrators can be integrated to asynchronous distributed DRL methods. Compared to vanilla A3C, our proposed methods both learn faster and converge to better policies on a two-player mini version of the Pommerman game.
Flow Shape Design for Microfluidic Devices Using Deep Reinforcement Learning
Lee, Xian Yeow, Balu, Aditya, Stoecklein, Daniel, Ganapathysubramanian, Baskar, Sarkar, Soumik
Microfluidic devices are utilized to control and direct flow behavior in a wide variety of applications, particularly in medical diagnostics. A particularly popular form of microfluidics -- called inertial microfluidic flow sculpting -- involves placing a sequence of pillars to controllably deform an initial flow field into a desired one. Inertial flow sculpting can be formally defined as an inverse problem, where one identifies a sequence of pillars (chosen, with replacement, from a finite set of pillars, each of which produce a specific transformation) whose composite transformation results in a user-defined desired transformation. Endemic to most such problems in engineering, inverse problems are usually quite computationally intractable, with most traditional approaches based on search and optimization strategies. In this paper, we pose this inverse problem as a Reinforcement Learning (RL) problem. We train a DoubleDQN agent to learn from this environment. The results suggest that learning is possible using a DoubleDQN model with the success frequency reaching 90% in 200,000 episodes and the rewards converging. While most of the results are obtained by fixing a particular target flow shape to simplify the learning problem, we later demonstrate how to transfer the learning of an agent based on one target shape to another, i.e. from one design to another and thus be useful for a generic design of a flow shape.
How to Organize your Deep Reinforcement Learning Agents: The Importance of Communication Topology
Adjodah, Dhaval, Calacci, Dan, Dubey, Abhimanyu, Krafft, Peter, Moro, Esteban, Pentland, Alex `Sandy'
In this empirical paper, we investigate how learning agents can be arranged in more efficient communication topologies for improved learning. This is an important problem because a common technique to improve speed and robustness of learning in deep reinforcement learning and many other machine learning algorithms is to run multiple learning agents in parallel. The standard communication architecture typically involves all agents intermittently communicating with each other (fully connected topology) or with a centralized server (star topology). Unfortunately, optimizing the topology of communication over the space of all possible graphs is a hard problem, so we borrow results from the networked optimization and collective intelligence literatures which suggest that certain families of network topologies can lead to strong improvements over fully-connected networks. We start by introducing alternative network topologies to DRL benchmark tasks under the Evolution Strategies paradigm which we call Network Evolution Strategies. We explore the relative performance of the four main graph families and observe that one such family (Erdos-Renyi random graphs) empirically outperforms all other families, including the de facto fully-connected communication topologies. Additionally, the use of alternative network topologies has a multiplicative performance effect: we observe that when 1000 learning agents are arranged in a carefully designed communication topology, they can compete with 3000 agents arranged in the de facto fully-connected topology. Overall, our work suggests that distributed machine learning algorithms would learn more efficiently if the communication topology between learning agents was optimized.
Transition-based versus State-based Reward Functions for MDPs with Value-at-Risk
In reinforcement learning, the reward function on current state and action is widely used. When the objective is about the expectation of the (discounted) total reward only, it works perfectly. However, if the objective involves the total reward distribution, the result will be wrong. This paper studies Value-at-Risk (VaR) problems in short- and long-horizon Markov decision processes (MDPs) with two reward functions, which share the same expectations. Firstly we show that with VaR objective, when the real reward function is transition-based (with respect to action and both current and next states), the simplified (state-based, with respect to action and current state only) reward function will change the VaR. Secondly, for long-horizon MDPs, we estimate the VaR function with the aid of spectral theory and the central limit theorem. Thirdly, since the estimation method is for a Markov reward process with the reward function on current state only, we present a transformation algorithm for the Markov reward process with the reward function on current and next states, in order to estimate the VaR function with an intact total reward distribution.