Reinforcement Learning
Neural SLAM: Learning to Explore with External Memory
Zhang, Jingwei, Tai, Lei, Liu, Ming, Boedecker, Joschka, Burgard, Wolfram
We present an approach for agents to learn representations of a global map from sensor data, to aid their exploration in new environments. To achieve this, we embed procedures mimicking that of traditional Simultaneous Localization and Mapping (SLAM) into the soft attention based addressing of external memory architectures, in which the external memory acts as an internal representation of the environment. This structure encourages the evolution of SLAM-like behaviors inside a completely differentiable deep neural network. We show that this approach can help reinforcement learning agents to successfully explore new environments where long-term memory is essential. We validate our approach in both challenging grid-world environments and preliminary Gazebo experiments. A video of our experiments can be found at: https://goo.gl/G2Vu5y.
Deep Reinforcement Learning: A State-of-the-Art Walkthrough
Lazaridis, Aristotelis | Fachantidis, Anestis (Postdoctoral Researcher / Co-Founder, CEO of Medoid AI) | Vlahavas, Ioannis (Professor, School of Informatics, Aristotle University of Thessaloniki, Greece)
Deep Reinforcement Learning is a topic that has gained a lot of attention recently, due to the unprecedented achievements and remarkable performance of such algorithms in various benchmark tests and environmental setups. The power of such methods comes from the combination of an already established and strong field of Deep Learning, with the unique nature of Reinforcement Learning methods. It is, however, deemed necessary to provide a compact, accurate and comparable view of these methods and their results for the means of gaining valuable technical and practical insights. In this work we gather the essential methods related to Deep Reinforcement Learning, extracting common property structures for three complementary core categories: a) Model-Free, b) Model-Based and c) Modular algorithms. For each category, we present, analyze and compare state-of-the-art Deep Reinforcement Learning algorithms that achieve high performance in various environments and tackle challenging problems in complex and demanding tasks. In order to give a compact and practical overview of their differences, we present comprehensive comparison figures and tables, produced by reported performances of the algorithms under two popular simulation platforms: the Atari Learning Environment and the MuJoCo physics simulation platform. We discuss the key differences of the various kinds of algorithms, indicate their potential and limitations, as well as provide insights to researchers regarding future directions of the field.
Leveraging AI and Intelligent Reflecting Surface for Energy-Efficient Communication in 6G IoT
Pan, Qianqian, Wu, Jun, Zheng, Xi, Li, Jianhua, Li, Shenghong, Vasilakos, Athanasios V.
The ever-increasing data traffic, various delay-sensitive services, and the massive deployment of energy-limited Internet of Things (IoT) devices have brought huge challenges to the current communication networks, motivating academia and industry to move to the sixth-generation (6G) network. With the powerful capability of data transmission and processing, 6G is considered as an enabler for IoT communication with low latency and energy cost. In this paper, we propose an artificial intelligence (AI) and intelligent reflecting surface (IRS) empowered energy-efficiency communication system for 6G IoT. First, we design a smart and efficient communication architecture including the IRS-aided data transmission and the AI-driven network resource management mechanisms. Second, an energy efficiency-maximizing model under given transmission latency for 6G IoT system is formulated, which jointly optimizes the settings of all communication participants, i.e. IoT transmission power, IRS-reflection phase shift, and BS detection matrix. Third, a deep reinforcement learning (DRL) empowered network resource control and allocation scheme is proposed to solve the formulated optimization model. Based on the network and channel status, the DRL-enabled scheme facilities the energy-efficiency and low-latency communication. Finally, experimental results verified the effectiveness of our proposed communication system for 6G IoT.
LISPR: An Options Framework for Policy Reuse with Reinforcement Learning
Graves, Daniel, Jin, Jun, Luo, Jun
We propose a framework for transferring any existing policy from a potentially unknown source MDP to a target MDP. This framework (1) enables reuse in the target domain of any form of source policy, including classical controllers, heuristic policies, or deep neural network-based policies, (2) attains optimality under suitable theoretical conditions, and (3) guarantees improvement over the source policy in the target MDP. These are achieved by packaging the source policy as a black-box option in the target MDP and providing a theoretically grounded way to learn the option's initiation set through general value functions. Our approach facilitates the learning of new policies by (1) maximizing the target MDP reward with the help of the black-box option, and (2) returning the agent to states in the learned initiation set of the black-box option where it is already optimal. We show that these two variants are equivalent in performance under some conditions. Through a series of experiments in simulated environments, we demonstrate that our framework performs excellently in sparse reward problems given (sub-)optimal source policies and improves upon prior art in transfer methods such as continual learning and progressive networks, which lack our framework's desirable theoretical properties.
A Deep Reinforcement Learning Based Multi-Criteria Decision Support System for Textile Manufacturing Process Optimization
He, Zhenglei, Tran, Kim Phuc, Thomassey, Sebastien, Zeng, Xianyi, Xu, Jie, Haiyi, Chang
Textile manufacturing is a typical traditional industry involving high complexity in interconnected processes with limited capacity on the application of modern technologies. Decision-making in this domain generally takes multiple criteria into consideration, which usually arouses more complexity. To address this issue, the present paper proposes a decision support system that combines the intelligent data-based random forest (RF) models and a human knowledge based analytical hierarchical process (AHP) multi-criteria structure in accordance to the objective and the subjective factors of the textile manufacturing process. More importantly, the textile manufacturing process is described as the Markov decision process (MDP) paradigm, and a deep reinforcement learning scheme, the Deep Q-networks (DQN), is employed to optimize it. The effectiveness of this system has been validated in a case study of optimizing a textile ozonation process, showing that it can better master the challenging decision-making tasks in textile manufacturing processes.
Improved Sample Complexity for Incremental Autonomous Exploration in MDPs
Tarbouriech, Jean, Pirotta, Matteo, Valko, Michal, Lazaric, Alessandro
We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $\epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we introduce a novel model-based approach that interleaves discovering new states from $s_0$ and improving the accuracy of a model estimate that is used to compute goal-conditioned policies to reach newly discovered states. The resulting algorithm, DisCo, achieves a sample complexity scaling as $\tilde{O}(L^5 S_{L+\epsilon} \Gamma_{L+\epsilon} A \epsilon^{-2})$, where $A$ is the number of actions, $S_{L+\epsilon}$ is the number of states that are incrementally reachable from $s_0$ in $L+\epsilon$ steps, and $\Gamma_{L+\epsilon}$ is the branching factor of the dynamics over such states. This improves over the algorithm proposed in [1] in both $\epsilon$ and $L$ at the cost of an extra $\Gamma_{L+\epsilon}$ factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an $\epsilon/c_{\min}$-optimal policy for any cost-sensitive shortest-path problem defined on the $L$-reachable states with minimum cost $c_{\min}$. Finally, we report preliminary empirical results confirming our theoretical findings.
Multi-Principal Assistance Games: Definition and Collegial Mechanisms
Fickinger, Arnaud, Zhuang, Simon, Critch, Andrew, Hadfield-Menell, Dylan, Russell, Stuart
We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory -- Gibbard's theorem -- by using a sufficiently "collegial" preference inference mechanism. In an MPAG, a single agent assists N human principals who may have widely different preferences. MPAGs generalize assistance games, also known as cooperative inverse reinforcement learning games. We analyze in particular a generalization of apprenticeship learning in which the humans first perform some work to obtain utility and demonstrate their preferences, and then the robot acts to further maximize the sum of human payoffs. We show in this setting that if the game is sufficiently collegial -- i.e., if the humans are responsible for obtaining a sufficient fraction of the rewards through their own actions -- then their preferences are straightforwardly revealed through their work. This revelation mechanism is non-dictatorial, does not limit the possible outcomes to two alternatives, and is dominant-strategy incentive-compatible.
Disentangled Planning and Control in Vision Based Robotics via Reward Machines
Camacho, Alberto, Varley, Jacob, Jain, Deepali, Iscen, Atil, Kalashnikov, Dmitry
In this work we augment a Deep Q-Learning agent with a Reward Machine (DQRM) to increase speed of learning vision-based policies for robot tasks, and overcome some of the limitations of DQN that prevent it from converging to good-quality policies. A reward machine (RM) is a finite state machine that decomposes a task into a discrete planning graph and equips the agent with a reward function to guide it toward task completion. The reward machine can be used for both reward shaping, and informing the policy what abstract state it is currently at. An abstract state is a high level simplification of the current state, defined in terms of task relevant features. These two supervisory signals of reward shaping and knowledge of current abstract state coming from the reward machine complement each other and can both be used to improve policy performance as demonstrated on several vision based robotic pick and place tasks. Particularly for vision based robotics applications, it is often easier to build a reward machine than to try and get a policy to learn the task without this structure.
Causal World Models by Unsupervised Deconfounding of Physical Dynamics
Li, Minne, Yang, Mengyue, Liu, Furui, Chen, Xu, Chen, Zhitang, Wang, Jun
The capability of imagining internally with a mental model of the world is vitally important for human cognition. If a machine intelligent agent can learn a world model to create a "dream" environment, it can then internally ask what-if questions -- simulate the alternative futures that haven't been experienced in the past yet -- and make optimal decisions accordingly. Existing world models are established typically by learning spatio-temporal regularities embedded from the past sensory signal without taking into account confounding factors that influence state transition dynamics. As such, they fail to answer the critical counterfactual questions about "what would have happened" if a certain action policy was taken. In this paper, we propose Causal World Models (CWMs) that allow unsupervised modeling of relationships between the intervened observations and the alternative futures by learning an estimator of the latent confounding factors. We empirically evaluate our method and demonstrate its effectiveness in a variety of physical reasoning environments. Specifically, we show reductions in sample complexity for reinforcement learning tasks and improvements in counterfactual physical reasoning.
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy
Zhong, Han, Fang, Ethan X., Yang, Zhuoran, Wang, Zhaoran
While deep reinforcement learning has achieved tremendous successes in various applications, most existing works only focus on maximizing the expected value of total return and thus ignore its inherent stochasticity. Such stochasticity is also known as the aleatoric uncertainty and is closely related to the notion of risk. In this work, we make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. In particular, we focus on a variance-constrained policy optimization problem where the goal is to find a policy that maximizes the expected value of the long-run average reward, subject to a constraint that the long-run variance of the average reward is upper bounded by a threshold. Utilizing Lagrangian and Fenchel dualities, we transform the original problem into an unconstrained saddle-point policy optimization problem, and propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable. When both the value and policy functions are represented by multi-layer overparameterized neural networks, we prove that our actor-critic algorithm generates a sequence of policies that finds a globally optimal policy at a sublinear rate.