Reinforcement Learning
ACTRCE: Augmenting Experience via Teacher's Advice For Multi-Goal Reinforcement Learning
Chan, Harris, Wu, Yuhuai, Kiros, Jamie, Fidler, Sanja, Ba, Jimmy
Sparse reward is one of the most challenging problems in reinforcement learning (RL). Hindsight Experience Replay (HER) attempts to address this issue by converting a failed experience to a successful one by relabeling the goals. Despite its effectiveness, HER has limited applicability because it lacks a compact and universal goal representation. We present Augmenting experienCe via TeacheR's adviCE (ACTRCE), an efficient reinforcement learning technique that extends the HER framework using natural language as the goal representation. We first analyze the differences among goal representation, and show that ACTRCE can efficiently solve difficult reinforcement learning problems in challenging 3D navigation tasks, whereas HER with non-language goal representation failed to learn. We also show that with language goal representations, the agent can generalize to unseen instructions, and even generalize to instructions with unseen lexicons. We further demonstrate it is crucial to use hindsight advice to solve challenging tasks, and even small amount of advice is sufficient for the agent to achieve good performance.
Deep Reinforcement Learning from Policy-Dependent Human Feedback
Arumugam, Dilip, Lee, Jun Ki, Saskin, Sophie, Littman, Michael L.
To widen their accessibility and increase their utility, intelligent agents must be able to learn complex behaviors as specified by (non-expert) human users. Moreover, they will need to learn these behaviors within a reasonable amount of time while efficiently leveraging the sparse feedback a human trainer is capable of providing. Recent work has shown that human feedback can be characterized as a critique of an agent's current behavior rather than as an alternative reward signal to be maximized, culminating in the COnvergent Actor-Critic by Humans (COACH) algorithm for making direct policy updates based on human feedback. Our work builds on COACH, moving to a setting where the agent's policy is represented by a deep neural network. We employ a series of modifications on top of the original COACH algorithm that are critical for successfully learning behaviors from high-dimensional observations, while also satisfying the constraint of obtaining reduced sample complexity. We demonstrate the effectiveness of our Deep COACH algorithm in the rich 3D world of Minecraft with an agent that learns to complete tasks by mapping from raw pixels to actions using only real-time human feedback in 10-15 minutes of interaction.
Emergence of Hierarchy via Reinforcement Learning Using a Multiple Timescale Stochastic RNN
Han, Dongqi, Doya, Kenji, Tani, Jun
Although recurrent neural networks (RNNs) for reinforcement learning (RL) have addressed unique advantages in various aspects, e. g., solving memory-dependent tasks and meta-learning, very few studies have demonstrated how RNNs can solve the problem of hierarchical RL by autonomously developing hierarchical control. In this paper, we propose a novel model-free RL framework called ReMASTER, which combines an off-policy actor-critic algorithm with a multiple timescale stochastic recurrent neural network for solving memory-dependent and hierarchical tasks. We performed experiments using a challenging continuous control task and showed that: (1) Internal representation necessary for achieving hierarchical control autonomously develops through exploratory learning. (2) Stochastic neurons in RNNs enable faster relearning when adapting to a new task which is a recomposition of sub-goals previously learned.
Podcast #297: Reinforcement Learning with AWS DeepRacer Amazon Web Services
How are ML Models Trained? How can developers learn different approaches to solving business problems? Todd Escalona (Solutions Architect Evangelist, AWS) joins Simon to dive into reinforcement learning and AWS DeepRacer! The AWS Podcast is a cloud platform podcast for developers, dev ops, and cloud professionals seeking the latest news and trends in storage, security, infrastructure, serverless, and more. Join Simon Elisha and Jeff Barr for regular updates, deep dives and interviews.
Reinforcement Learning: Coming to a Home Called Yours!
I loved playing StarCraft, though I seldom played against other humans (my sons in particular, because they absolutely kick my butt). But ah, there is finally revenge for "Dad the Data Nerd", and it's known as AlphaStar. AlphaStar was developed by Google's DeepMind AI group to leverage artificial intelligence (AI) to master the game of StarCraft. StarCraft is much trickier for AI to master than games like Go and Mario Bros because of its unbounded complexity, continuously-changing gameplay (rather than the distinct events which occur when players take turns), evolving battlefield situations and dependency on constantly tweaking one's in-game strategy. I want to spend the rest of this blog doing a deep dive on Reinforcement Learning, because to me it is the trial-and-error nature of learning that places Reinforcement Learning squarely in the heart of future Artificial Intelligence aspirations.
Constraint Satisfaction Propagation: Non-stationary Policy Synthesis for Temporal Logic Planning
Ringstrom, Thomas J., Schrater, Paul R.
The detective will need to capture dependencies between sequential timeconstrained reason about the order in which these sub-goals are executed goal states because the state-space and may need to use knowledge of individual deadlines to must be prohibitively expanded to accommodate put constraints on the possible sub-goal sequences. For a history of successfully achieved sub-goals. Also, example, the detective knows that two key witnesses will policies and value functions derived with stationarity be leaving town for work in the morning and the two main assumptions are not readily decomposable, suspects will likely leave town later in the day. The detective leading to a tension between reward maximization will thus conclude that the witnesses must be questioned and task generalization. We demonstrate a logiccompatible first so that there is enough time and evidence to arrest and approach using model-based knowledge interrogate the suspects, as they cannot be held in custody of environment dynamics and deadline information for longer than a day. The order in which the two witnesses to directly infer non-stationary policies are questioned and the order in which the two suspects are composed of reusable stationary policies. The arrested does not matter for the satisfaction of the task which policies are constructed to maximize the probability only requires that all sub-goals are met before their individual of satisfying time-sensitive goals while respecting deadlines, leading to four distinct possible sequences of time-varying obstacles. Our approach explicitly sub-goals that can be executed. Furthermore, the difficulty maintains two different spaces, a high-level of this task is compounded by the fact that the detective must logical task specification where the task-variables have knowledge of the underlying movement constraints are grounded onto the low-level state-space of and knowledge of the dynamics of the environment.
Preferences Implicit in the State of the World
Shah, Rohin, Krasheninnikov, Dmitrii, Alexander, Jordan, Abbeel, Pieter, Dragan, Anca
Reinforcement learning (RL) agents optimize only the features specified in a reward function and are indifferent to anything left out inadvertently. This means that we must not only specify what to do, but also the much larger space of what not to do. It is easy to forget these preferences, since these preferences are already satisfied in our environment. This motivates our key insight: when a robot is deployed in an environment that humans act in, the state of the environment is already optimized for what humans want. We can therefore use this implicit preference information from the state to fill in the blanks. We develop an algorithm based on Maximum Causal Entropy IRL and use it to evaluate the idea in a suite of proof-of-concept environments designed to show its properties. We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized.
Performance Dynamics and Termination Errors in Reinforcement Learning: A Unifying Perspective
Kuang, Nikki Lijing, Leung, Clement H. C.
In reinforcement learning, a decision needs to be made at some point as to whether it is worthwhile to carry on with the learning process or to terminate it. In many such situations, stochastic elements are often present which govern the occurrence of rewards, with the sequential occurrences of positive rewards randomly interleaved with negative rewards. For most practical learners, the learning is considered useful if the number of positive rewards always exceeds the negative ones. A situation that often calls for learning termination is when the number of negative rewards exceeds the number of positive rewards. However, while this seems reasonable, the error of premature termination, whereby termination is enacted along with the conclusion of learning failure despite the positive rewards eventually far outnumber the negative ones, can be significant. In this paper, using combinatorial analysis we study the error probability in wrongly terminating a reinforcement learning activity which undermines the effectiveness of an optimal policy, and we show that the resultant error can be quite high. Whilst we demonstrate mathematically that such errors can never be eliminated, we propose some practical mechanisms that can effectively reduce such errors. Simulation experiments have been carried out, the results of which are in close agreement with our theoretical findings.
Stochastic Reinforcement Learning
Kuang, Nikki Lijing, Leung, Clement H. C., Sung, Vienne W. K.
In reinforcement learning episodes, the rewards and punishments are often non-deterministic, and there are invariably stochastic elements governing the underlying situation. Such stochastic elements are often numerous and cannot be known in advance, and they have a tendency to obscure the underlying rewards and punishments patterns. Indeed, if stochastic elements were absent, the same outcome would occur every time and the learning problems involved could be greatly simplified. In addition, in most practical situations, the cost of an observation to receive either a reward or punishment can be significant, and one would wish to arrive at the correct learning conclusion by incurring minimum cost. In this paper, we present a stochastic approach to reinforcement learning which explicitly models the variability present in the learning environment and the cost of observation. Criteria and rules for learning success are quantitatively analyzed, and probabilities of exceeding the observation cost bounds are also obtained.
WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving
Lee, Jaeyoung, Balakrishnan, Aravind, Gaurav, Ashish, Czarnecki, Krzysztof, Sedwards, Sean
Machine learning can provide efficient solutions to the complex problems encountered in autonomous driving, but ensuring their safety remains a challenge. A number of authors have attempted to address this issue, but there are few publicly-available tools to adequately explore the trade-offs between functionality, scalability, and safety. We thus present WiseMove, a software framework to investigate safe deep reinforcement learning in the context of motion planning for autonomous driving. WiseMove adopts a modular learning architecture that suits our current research questions and can be adapted to new technologies and new questions. We present the details of WiseMove, demonstrate its use on a common traffic scenario, and describe how we use it in our ongoing safe learning research.