Reinforcement Learning
Amazon Wants You to Code the AI Brain for This Little Car
Two years ago, Alphabet researchers made computing history when their artificial intelligence software AlphaGo defeated a world champion at the complex board game Go. Amazon now hopes to democratize the AI technique behind that milestone--with a pint-size self-driving car. The 1/18th-scale vehicle is called DeepRacer, and it can be preordered for $249; it will later cost $399. It's designed to make it easier for programmers to get started with reinforcement learning, the technique that powered AlphaGo's victory and is loosely inspired by how animals learn from feedback on their behavior. Although the approach has produced notable research stunts, such as bots that can play Go, chess, and complicated multiplayer electronic games, it isn't as widely used as the pattern-matching learning techniques used in speech recognition and image analysis.
A Structure-aware Online Learning Algorithm for Markov Decision Processes
Roy, Arghyadip, Borkar, Vivek, Karandikar, Abhay, Chaporkar, Prasanna
To overcome the curse of dimensionality and curse of modeling in Dynamic Programming (DP) methods for solving classical Markov Decision Process (MDP) problems, Reinforcement Learning (RL) algorithms are popular. In this paper, we consider an infinite-horizon average reward MDP problem and prove the optimality of the threshold policy under certain conditions. Traditional RL techniques do not exploit the threshold nature of optimal policy while learning. In this paper, we propose a new RL algorithm which utilizes the known threshold structure of the optimal policy while learning by reducing the feasible policy space. We establish that the proposed algorithm converges to the optimal policy. It provides a significant improvement in convergence speed and computational and storage complexity over traditional RL algorithms. The proposed technique can be applied to a wide variety of optimization problems that include energy efficient data transmission and management of queues. We exhibit the improvement in convergence speed of the proposed algorithm over other RL algorithms through simulations.
Understanding the impact of entropy on policy optimization
Ahmed, Zafarali, Roux, Nicolas Le, Norouzi, Mohammad, Schuurmans, Dale
Entropy regularization is commonly used to improve policy optimization in reinforcement learning. It is believed to help with exploration by encouraging the selection of more stochastic policies. In this work, we analyze this claim and, through new visualizations of the optimization landscape, we observe that incorporating entropy in policy optimization serves as a regularizer. We show that even with access to the exact gradient, policy optimization is difficult due to the geometry of the objective function. We qualitatively show that, in some environments, entropy regularization can make the optimization landscape smoother, thereby connecting local optima and enabling the use of larger learning rates. This manuscript presents new tools for understanding the underlying optimization landscape and highlights the challenge of designing general-purpose policy optimization algorithms in reinforcement learning.
Experience Replay for Continual Learning
Rolnick, David, Ahuja, Arun, Schwarz, Jonathan, Lillicrap, Timothy P., Wayne, Greg
Continual learning is the problem of learning new tasks or knowledge while protecting old knowledge and ideally generalizing from old experience to learn new tasks faster. Neural networks trained by stochastic gradient descent often degrade on old tasks when trained successively on new tasks with different data distributions. This phenomenon, referred to as catastrophic forgetting, is considered a major hurdle to learning with non-stationary data or sequences of new tasks, and prevents networks from continually accumulating knowledge and skills. We examine this issue in the context of reinforcement learning, in a setting where an agent is exposed to tasks in a sequence. Unlike most other work, we do not provide an explicit indication to the model of task boundaries, which is the most general circumstance for a learning agent exposed to continuous experience. While various methods to counteract catastrophic forgetting have recently been proposed, we explore a straightforward, general, and seemingly overlooked solution - that of using experience replay buffers for all past events - with a mixture of on- and off-policy learning, leveraging behavioral cloning. We show that this strategy can still learn new tasks quickly yet can substantially reduce catastrophic forgetting in both Atari and DMLab domains, even matching the performance of methods that require task identities. When buffer storage is constrained, we confirm that a simple mechanism for randomly discarding data allows a limited size buffer to perform almost as well as an unbounded one.
Efficiently Combining Human Demonstrations and Interventions for Safe Training of Autonomous Systems in Real-Time
Goecks, Vinicius G., Gremillion, Gregory M., Lawhern, Vernon J., Valasek, John, Waytowich, Nicholas R.
This paper investigates how to utilize different forms of human interaction to safely train autonomous systems in real-time by learning from both human demonstrations and interventions. We implement two components of the Cycle-of-Learning for Autonomous Systems, which is our framework for combining multiple modalities of human interaction. The current effort employs human demonstrations to teach a desired behavior via imitation learning, then leverages intervention data to correct for undesired behaviors produced by the imitation learner to teach novel tasks to an autonomous agent safely, after only minutes of training. We demonstrate this method in an autonomous perching task using a quadrotor with continuous roll, pitch, yaw, and throttle commands and imagery captured from a downward-facing camera in a high-fidelity simulated environment. Our method improves task completion performance for the same amount of human interaction when compared to learning from demonstrations alone, while also requiring on average 32% less data to achieve that performance. This provides evidence that combining multiple modes of human interaction can increase both the training speed and overall performance of policies for autonomous systems.
Target Driven Visual Navigation with Hybrid Asynchronous Universal Successor Representations
Siriwardhana, Shamane, Weerasekera, Rivindu, Nanayakkara, Suranga
Being able to navigate to a target with minimal supervision and prior knowledge is critical to creating human-like assistive agents. Prior work on map-based and map-less approaches have limited generalizability. In this paper, we present a novel approach, Hybrid Asynchronous Universal Successor Representations (HAUSR), which overcomes the problem of generalizability to new goals by adapting recent work on Universal Successor Representations with Asynchronous Actor-Critic Agents. We show that the agent was able to successfully reach novel goals and we were able to quickly fine-tune the network for adapting to new scenes. This opens up novel application scenarios where intelligent agents could learn from and adapt to a wide range of environments with minimal human input.
Grammars and reinforcement learning for molecule optimization
An important challenge in drug discovery is to find molecules with desired chemical properties. While ultimate usefulness as a drug can only be determined in a laboratory or clinical context, that process is expensive, and it is thus advantageous to pre-select likely candidates in software. While deep learning has been extensively investigated for molecular graph encoding ([Duvenaud et al., 2015], [Kearnes et al., 2016], [Gilmer et al., 2017]), molecule generation is still subject of active research. The simplest natural approach to candidate molecule generation is to generate some sort of a linear representation,such as a string of characters in the SMILES format [Weininger, 1988], using an encoder-decoder network architecture similar to that used in machine translation, as done in [Gรณmez-Bombarelli et al., 2016]. This approach's performance was comparatively poor because a molecule's structure is not linear, but rather a graph which typically includes cycles, so it falls to the model to learn how to generate SMILES strings that correspond to chemically valid molecules - a nontrivial task that leaves the model with little spare capacity to additionally optimize a given chemical metric of the molecules produced. A way to partially remedy that involves generating not the actual SMILES strings, but a sequence ofproduction rules of a context-free grammar (CFG) for SMILES, as done by [Kusner et al., 2017]. That guarantees that the SMILES strings produced are grammatically valid, putting less burden on the model to ensure validity and thereby achieving better metrics. However, [Kusner et al., 2017] give two reasons why this is still not guaranteed to produce chemically valid molecules: firstly, a grammatically valid SMILES string is not guaranteed to be chemically possible (because of atom valences being wrong, for example), and secondly, because a
Unsupervised Control Through Non-Parametric Discriminative Rewards
Warde-Farley, David, Van de Wiele, Tom, Kulkarni, Tejas, Ionescu, Catalin, Hansen, Steven, Mnih, Volodymyr
Learning to control an environment without hand-crafted rewards or expert data remains challenging and is at the frontier of reinforcement learning research. We present an unsupervised learning algorithm to train agents to achieve perceptually-specified goals using only a stream of observations and actions. Our agent simultaneously learns a goal-conditioned policy and a goal achievement reward function that measures how similar a state is to the goal state. This dual optimization leads to a co-operative game, giving rise to a learned reward function that reflects similarity in controllable aspects of the environment instead of distance in the space of observations. We demonstrate the efficacy of our agent to learn, in an unsupervised manner, to reach a diverse set of goals on three domains -- Atari, the DeepMind Control Suite and DeepMind Lab.
Prioritizing Starting States for Reinforcement Learning
Tavakoli, Arash, Levdik, Vitaly, Islam, Riashat, Kormushev, Petar
Online, off-policy reinforcement learning algorithms are able to use an experience memory to remember and replay past experiences. In prior work, this approach was used to stabilize training by breaking the temporal correlations of the updates and avoiding the rapid forgetting of possibly rare experiences. In this work, we propose a conceptually simple framework that uses an experience memory to help exploration by prioritizing the starting states from which the agent starts acting in the environment, importantly, in a fashion that is also compatible with on-policy algorithms. Given the capacity to restart the agent in states corresponding to its past observations, we achieve this objective by (i) enabling the agent to restart in states belonging to significant past experiences (e.g., nearby goals), and (ii) promoting faster coverage of the state space through starting from a more diverse set of states. While, using a good measure of priority to identify significant past transitions, we expect case (i) to more considerably help exploration in certain problems (e.g., sparse reward tasks), we hypothesize that case (ii) will generally be beneficial, even without any prioritization. We show empirically that our approach improves learning performance for both off-policy and on-policy deep reinforcement learning methods, with the most notable improvement in a significantly sparse reward task.
Scaling Configuration of Energy Harvesting Sensors with Reinforcement Learning
Fraternali, Francesco, Balaji, Bharathan, Gupta, Rajesh
With the advent of the Internet of Things (IoT), an increasing number of energy harvesting methods are being used to supplement or supplant battery based sensors. Energy harvesting sensors need to be configured according to the application, hardware, and environmental conditions to maximize their usefulness. As of today, the configuration of sensors is either manual or heuristics based, requiring valuable domain expertise. Reinforcement learning (RL) is a promising approach to automate configuration and efficiently scale IoT deployments, but it is not yet adopted in practice. We propose solutions to bridge this gap: reduce the training phase of RL so that nodes are operational within a short time after deployment and reduce the computational requirements to scale to large deployments. We focus on configuration of the sampling rate of indoor solar panel based energy harvesting sensors. We created a simulator based on 3 months of data collected from 5 sensor nodes subject to different lighting conditions. Our simulation results show that RL can effectively learn energy availability patterns and configure the sampling rate of the sensor nodes to maximize the sensing data while ensuring that energy storage is not depleted. The nodes can be operational within the first day by using our methods. We show that it is possible to reduce the number of RL policies by using a single policy for nodes that share similar lighting conditions.