Reinforcement Learning
Automata Guided Reinforcement Learning With Demonstrations
Li, Xiao, Ma, Yao, Belta, Calin
Tasks with complex temporal structures and long horizons pose a challenge for reinforcement learning agents due to the difficulty in specifying the tasks in terms of reward functions as well as large variances in the learning signals. We propose to address these problems by combining temporal logic (TL) with reinforcement learning from demonstrations. Our method automatically generates intrinsic rewards that align with the overall task goal given a TL task specification. The policy resulting from our framework has an interpretable and hierarchical structure. We validate the proposed method experimentally on a set of robotic manipulation tasks.
Personalized Education at Scale
Saarinen, Sam, Cater, Evan, Littman, Michael
Tailoring the presentation of information to the needs of individual students leads to massive gains in student outcomes (Bloom 1984). This finding is likely due to the fact that different students learn differently, perhaps as a result of variation in ability, interest or other factors (Schiefele, Krapp, and Winteler 1992). Adapting presentations to the educational needs of an individual has traditionally been the domain of experts, making it expensive and logistically challenging to do at scale, and also leading to inequity in educational outcomes. Increased course sizes and large MOOC enrollments provide an unprecedented access to student data. We propose that emerging technologies in reinforcement learning (RL), as well as semi-supervised learning, natural language processing, and computer vision are critical to leveraging this data to provide personalized education at scale.
Low Precision Policy Distillation with Application to Low-Power, Real-time Sensation-Cognition-Action Loop with Neuromorphic Computing
Mckinstry, Jeffrey L, Barch, Davis R., Bablani, Deepika, Debole, Michael V., Esser, Steven K., Kusnitz, Jeffrey A., Arthur, John V., Modha, Dharmendra S.
Low precision networks in the reinforcement learning (RL) setting are relatively unexplored because of the limitations of binary activations for function approximation. Here, in the discrete action ATARI domain, we demonstrate, for the first time, that low precision policy distillation from a high precision network provides a principled, practical way to train an RL agent. As an application, on 10 different ATARI games, we demonstrate real-time end-to-end game playing on low-power neuromorphic hardware by converting a sequence of game frames into discrete actions.
EpiRL: A Reinforcement Learning Agent to Facilitate Epistasis Detection
Huang, Kexin, Nogueira, Rodrigo
Epistasis (gene-gene interaction) is crucial to predicting genetic disease. Our work tackles the computational challenges faced by previous works in epistasis detection by modeling it as a one-step Markov Decision Process where the state is genome data, the actions are the interacted genes, and the reward is an interaction measurement for the selected actions. A reinforcement learning agent using policy gradient method then learns to discover a set of highly interacted genes. The fundamental goal for studying genetics is to understand how certain genes can incur disease and traits. Since the advent of Genome-Wide Association Studies (GWAS) (Burton et al., 2007), thousands of SNP (Single Nucleotide Polymorphism)s have been identified and associated with genetic diseases and traits.
SDN Flow Entry Management Using Reinforcement Learning
Mu, Ting-Yu, Al-Fuqaha, Ala, Shuaib, Khaled, Sallabi, Farag M., Qadir, Junaid
Modern information technology services largely depend on cloud infrastructures to provide their services. These cloud infrastructures are built on top of datacenter networks (DCNs) constructed with high-speed links, fast switching gear, and redundancy to offer better flexibility and resiliency. In this environment, network traffic includes long-lived (elephant) and short-lived (mice) flows with partitioned and aggregated traffic patterns. Although SDN-based approaches can efficiently allocate networking resources for such flows, the overhead due to network reconfiguration can be significant. With limited capacity of Ternary Content-Addressable Memory (TCAM) deployed in an OpenFlow enabled switch, it is crucial to determine which forwarding rules should remain in the flow table, and which rules should be processed by the SDN controller in case of a table-miss on the SDN switch. This is needed in order to obtain the flow entries that satisfy the goal of reducing the long-term control plane overhead introduced between the controller and the switches. To achieve this goal, we propose a machine learning technique that utilizes two variations of reinforcement learning (RL) algorithms-the first of which is traditional reinforcement learning algorithm based while the other is deep reinforcement learning based. Emulation results using the RL algorithm show around 60% improvement in reducing the long-term control plane overhead, and around 14% improvement in the table-hit ratio compared to the Multiple Bloom Filters (MBF) method given a fixed size flow table of 4KB.
Resilient Computing with Reinforcement Learning on a Dynamical System: Case Study in Sorting
Faust, Aleksandra, Aimone, James B., James, Conrad D., Tapia, Lydia
In particular, reinforcement learning (RL) and feedback control can be used to help a robot achieve a goal. Taking advantage of this body of work, this paper formulates general computation as a feedback-control problem, which allows the agent to autonomously overcome some limitations of standard procedural language programming: resilience to errors and early program termination. Our formulation considers computation to be trajectory generation in the program's variable space. The computing then becomes a sequential decision making problem, solved with reinforcement learning (RL), and analyzed with Lyapunov stability theory to assess the agent's resilience and progression to the goal. We do this through a case study on a quintessential computer science problem, array sorting. Evaluations show that our RL sorting agent makes steady progress to an asymptotically stable goal, is resilient to faulty components, and performs less array manipulations than traditional Quicksort and Bubble sort.
Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning
Agarwal, Akshat, Kumar, Abhinau V, Dunovan, Kyle, Peterson, Erik, Verstynen, Timothy, Sycara, Katia
In the real world, agents often have to operate in situations with incomplete information, limited sensing capabilities, and inherently stochastic environments, making individual observations incomplete and unreliable. Moreover, in many situations it is preferable to delay a decision rather than run the risk of making a bad decision. In such situations it is necessary to aggregate information before taking an action; however, most state of the art reinforcement learning (RL) algorithms are biased towards taking actions \textit{at every time step}, even if the agent is not particularly confident in its chosen action. This lack of caution can lead the agent to make critical mistakes, regardless of prior experience and acclimation to the environment. Motivated by theories of dynamic resolution of uncertainty during decision making in biological brains, we propose a simple accumulator module which accumulates evidence in favor of each possible decision, encodes uncertainty as a dynamic competition between actions, and acts on the environment only when it is sufficiently confident in the chosen action. The agent makes no decision by default, and the burden of proof to make a decision falls on the policy to accrue evidence strongly in favor of a single decision. Our results show that this accumulator module achieves near-optimal performance on a simple guessing game, far outperforming deep recurrent networks using traditional, forced action selection policies.
On Reinforcement Learning for Full-length Game of StarCraft
Pang, Zhen-Jia, Liu, Ruo-Ze, Meng, Zhou-Yu, Zhang, Yi, Yu, Yang, Lu, Tong
StarCraft II poses a grand challenge for reinforcement learning. The main difficulties of it include huge state and action space and a long-time horizon. In this paper, we investigate a hierarchical reinforcement learning approach for StarCraft II. The hierarchy involves two levels of abstraction. One is the macro-action automatically extracted from expert's trajectories, which reduces the action space in an order of magnitude yet remains effective. The other is a two-layer hierarchical architecture which is modular and easy to scale, enabling a curriculum transferring from simpler tasks to more complex tasks. The reinforcement training algorithm for this architecture is also investigated. On a 64x64 map and using restrictive units, we achieve a winning rate of more than 99\% against the difficulty level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat model, we can achieve over 93\% winning rate of Protoss against the most difficult non-cheating built-in AI (level-7) of Terran, training within two days using a single machine with only 48 CPU cores and 8 K40 GPUs. It also shows strong generalization performance, when tested against never seen opponents including cheating levels built-in AI and all levels of Zerg and Protoss built-in AI. We hope this study could shed some light on the future research of large-scale reinforcement learning.
A Learning Framework for High Precision Industrial Assembly
Fan, Yongxiang, Luo, Jieliang, Tomizuka, Masayoshi
Automatic assembly has broad applications in industries. Traditional assembly tasks utilize predefined trajectories or tuned force control parameters, which make the automatic assembly time-consuming, difficult to generalize, and not robust to uncertainties. In this paper, we propose a learning framework for high precision industrial assembly. The framework combines both the supervised learning and the reinforcement learning. The supervised learning utilizes trajectory optimization to provide the initial guidance to the policy, while the reinforcement learning utilizes actor-critic algorithm to establish the evaluation system when the supervisor is not accurate. The proposed learning framework is more efficient compared with the reinforcement learning and achieves better stability performance than the supervised learning. The effectiveness of the method is verified by both the simulation and experiment. Experimental videos are available at~\cite{website}.
Move 37 Course – School of AI
Join me as I teach this free 10-week reinforcement learning course I've called Move 37. I'll take you on a journey through the basics up to modern day techniques. Every week, we'll build apps together that will cover both toy and industry problems. You'll be able to measure your progress along the way by chatting with your peers both online and offline at the School of AI chapters globally, taking quizzes, coding challenges, and 2 graded projects. I'll have weekly coding live streams to help answer any questions, and my assistant instructors will be available to help in our community slack channel. Students who successfully complete the course will receive their own certificate signed by Siraj Raval.