Reinforcement Learning
Policy Search with Rare Significant Events: Choosing the Right Partner to Cooperate with
Ecoffet, Paul, Fontbonne, Nicolas, André, Jean-Baptiste, Bredeche, Nicolas
This paper focuses on a class of reinforcement learning problems where significant events are rare and limited to a single positive reward per episode. A typical example is that of an agent who has to choose a partner to cooperate with, while a large number of partners are simply not interested in cooperating, regardless of what the agent has to offer. We address this problem in a continuous state and action space with two different kinds of search methods: a gradient policy search method and a direct policy search method using an evolution strategy. We show that when significant events are rare, gradient information is also scarce, making it difficult for policy gradient search methods to find an optimal policy, with or without a deep neural architecture. On the other hand, we show that direct policy search methods are invariant to the rarity of significant events, which is yet another confirmation of the unique role evolutionary algorithms has to play as a reinforcement learning method.
Adapting User Interfaces with Model-based Reinforcement Learning
Todi, Kashyap, Bailly, Gilles, Leiva, Luis A., Oulasvirta, Antti
Adapting an interface requires taking into account both the positive and negative effects that changes may have on the user. A carelessly picked adaptation may impose high costs to the user -- for example, due to surprise or relearning effort -- or "trap" the process to a suboptimal design immaturely. However, effects on users are hard to predict as they depend on factors that are latent and evolve over the course of interaction. We propose a novel approach for adaptive user interfaces that yields a conservative adaptation policy: It finds beneficial changes when there are such and avoids changes when there are none. Our model-based reinforcement learning method plans sequences of adaptations and consults predictive HCI models to estimate their effects. We present empirical and simulation results from the case of adaptive menus, showing that the method outperforms both a non-adaptive and a frequency-based policy.
Auto-COP: Adaptation Generation in Context-Oriented Programming using Reinforcement Learning Options
Cardozo, Nicolás, Dusparic, Ivana
Self-adaptive software systems continuously adapt in response to internal and external changes in their execution environment, captured as contexts. The COP paradigm posits a technique for the development of self-adaptive systems, capturing their main characteristics with specialized programming language constructs. COP adaptations are specified as independent modules composed in and out of the base system as contexts are activated and deactivated in response to sensed circumstances from the surrounding environment. However, the definition of adaptations, their contexts and associated specialized behavior, need to be specified at design time. In complex CPS this is intractable due to new unpredicted operating conditions. We propose Auto-COP, a new technique to enable generation of adaptations at run time. Auto-COP uses RL options to build action sequences, based on the previous instances of the system execution. Options are explored in interaction with the environment, and the most suitable options for each context are used to generate adaptations exploiting COP. To validate Auto-COP, we present two case studies exhibiting different system characteristics and application domains: a driving assistant and a robot delivery system. We present examples of Auto-COP code generated at run time, to illustrate the types of circumstances (contexts) requiring adaptation, and the corresponding generated adaptations for each context. We confirm that the generated adaptations exhibit correct system behavior measured by domain-specific performance metrics, while reducing the number of required execution/actuation steps by a factor of two showing that the adaptations are regularly selected by the running system as adaptive behavior is more appropriate than the execution of primitive actions.
A Quadratic Actor Network for Model-Free Reinforcement Learning
Weissenbacher, Matthias, Kawahara, Yoshinobu
In this work we discuss the incorporation of quadratic neurons into policy networks in the context of model-free actor-critic reinforcement learning. Quadratic neurons admit an explicit quadratic function approximation in contrast to conventional approaches where the the non-linearity is induced by the activation functions. We perform empiric experiments on several MuJoCo continuous control tasks and find that when quadratic neurons are added to MLP policy networks those outperform the baseline MLP whilst admitting a smaller number of parameters. The top returned reward is in average increased by $5.8\%$ while being about $21\%$ more sample efficient. Moreover, it can maintain its advantage against added action and observation noise.
Symbolic Reinforcement Learning for Safe RAN Control
Nikou, Alexandros, Mujumdar, Anusha, Orlic, Marin, Feljan, Aneta Vulgarakis
In order to express desired (SRL) architecture for safe control in Radio Access Network (RAN) specifications to the network into consideration, LTL is used applications. In our automated tool, a user can select a high-level (see [2, 10, 12, 13]), due to the fact that it provides a powerful mathematical safety specifications expressed in Linear Temporal Logic (LTL) to formalism for such purpose. Our proposed demonstration shield an RL agent running in a given cellular network with aim exhibits the following attributes: of optimizing network performance, as measured through certain (1) a general automatic framework from LTL specification user Key Performance Indicators (KPIs). In the proposed architecture, input to the derivation of the policy that fulfills it; at the same network safety shielding is ensured through model-checking techniques time, blocking the control actions that violate the specification; over combined discrete system models (automata) that are (2) novel system dynamics abstraction to companions Markov Decision abstracted through reinforcement learning. We demonstrate the Processes (MDP) which is computationally efficient; user interface (UI) helping the user set intent specifications to the (3) UI development that allows the user to graphically access all architecture and inspect the difference in allowed and blocked actions.
Multi-Task Federated Reinforcement Learning with Adversaries
Anwar, Aqeel, Raychowdhury, Arijit
Reinforcement learning algorithms, just like any other Machine learning algorithm pose a serious threat from adversaries. The adversaries can manipulate the learning algorithm resulting in non-optimal policies. In this paper, we analyze the Multi-task Federated Reinforcement Learning algorithms, where multiple collaborative agents in various environments are trying to maximize the sum of discounted return, in the presence of adversarial agents. We argue that the common attack methods are not guaranteed to carry out a successful attack on Multi-task Federated Reinforcement Learning and propose an adaptive attack method with better attack performance. Furthermore, we modify the conventional federated reinforcement learning algorithm to address the issue of adversaries that works equally well with and without the adversaries. Experimentation on different small to mid-size reinforcement learning problems show that the proposed attack method outperforms other general attack methods and the proposed modification to federated reinforcement learning algorithm was able to achieve near-optimal policies in the presence of adversarial agents.
Generalizable Episodic Memory for Deep Reinforcement Learning
Hu, Hao, Ye, Jianing, Ren, Zhizhou, Zhu, Guangxiang, Zhang, Chongjie
Episodic memory-based methods can rapidly latch onto past successful strategies by a non-parametric memory and improve sample efficiency of traditional reinforcement learning. However, little effort is put into the continuous domain, where a state is never visited twice and previous episodic methods fail to efficiently aggregate experience across trajectories. To address this problem, we propose Generalizable Episodic Memory (GEM), which effectively organizes the state-action values of episodic memory in a generalizable manner and supports implicit planning on memorized trajectories. GEM utilizes a double estimator to reduce the overestimation bias induced by value propagation in the planning process. Empirical evaluation shows that our method significantly outperforms existing trajectory-based methods on various MuJoCo continuous control tasks. To further show the general applicability, we evaluate our method on Atari games with discrete action space, which also shows significant improvement over baseline algorithms.
A Scavenger Hunt for Service Robots
Yedidsion, Harel, Suriadinata, Jennifer, Xu, Zifan, Debruyn, Stefan, Stone, Peter
Creating robots that can perform general-purpose service tasks in a human-populated environment has been a longstanding grand challenge for AI and Robotics research. One particularly valuable skill that is relevant to a wide variety of tasks is the ability to locate and retrieve objects upon request. This paper models this skill as a Scavenger Hunt (SH) game, which we formulate as a variation of the NP-hard stochastic traveling purchaser problem. In this problem, the goal is to find a set of objects as quickly as possible, given probability distributions of where they may be found. We investigate the performance of several solution algorithms for the SH problem, both in simulation and on a real mobile robot. We use Reinforcement Learning (RL) to train an agent to plan a minimal cost path, and show that the RL agent can outperform a range of heuristic algorithms, achieving near optimal performance. In order to stimulate research on this problem, we introduce a publicly available software stack and associated website that enable users to upload scavenger hunts which robots can download, perform, and learn from to continually improve their performance on future hunts.
Getting started with Reinforcement Learning
Today, Artificial Intelligence (AI) has undergone impressive advancements. Right now, thanks to Machine Learning, we have been able to achieve good competency at the Narrow AI level. Reinforcement Learning, is now considered to be the most promising technique in order to move to the next level in the AI paradigm (Figure 1). One of the reasons why Reinforcement Learning has gained so much interest today, is its interdisciplinarity. The core concepts of this area, follow in fact basic game theory, evolutionary and neuroscience principles.
Induction and Exploitation of Subgoal Automata for Reinforcement Learning
Furelos-Blanco, Daniel (Imperial College London) | Law, Mark (Imperial College London) | Jonsson, Anders (Universitat Pompeu Fabra) | Broda, Krysia | Russo, Alessandra
In this paper we present ISA, an approach for learning and exploiting subgoals in episodic reinforcement learning (RL) tasks. ISA interleaves reinforcement learning with the induction of a subgoal automaton, an automaton whose edges are labeled by the task’s subgoals expressed as propositional logic formulas over a set of high-level events. A subgoal automaton also consists of two special states: a state indicating the successful completion of the task, and a state indicating that the task has finished without succeeding. A state-of-the-art inductive logic programming system is used to learn a subgoal automaton that covers the traces of high-level events observed by the RL agent. When the currently exploited automaton does not correctly recognize a trace, the automaton learner induces a new automaton that covers that trace. The interleaving process guarantees the induction of automata with the minimum number of states, and applies a symmetry breaking mechanism to shrink the search space whilst remaining complete. We evaluate ISA in several gridworld and continuous state space problems using different RL algorithms that leverage the automaton structures. We provide an in-depth empirical analysis of the automaton learning performance in terms of the traces, the symmetry breaking and specific restrictions imposed on the final learnable automaton. For each class of RL problem, we show that the learned automata can be successfully exploited to learn policies that reach the goal, achieving an average reward comparable to the case where automata are not learned but handcrafted and given beforehand.