Reinforcement Learning
An Empirical Comparison on Imitation Learning and Reinforcement Learning for Paraphrase Generation
Generating paraphrases from given sentences involves decoding words step by step from a large vocabulary. To learn a decoder, supervised learning which maximizes the likelihood of tokens always suffers from the exposure bias. Although both reinforcement learning (RL) and imitation learning (IL) have been widely used to alleviate the bias, the lack of direct comparison leads to only a partial image on their benefits. In this work, we present an empirical study on how RL and IL can help boost the performance of generating paraphrases, with the pointer-generator as a base model. Experiments on the benchmark datasets show that (1) imitation learning is constantly better than reinforcement learning; and (2) the pointer-generator models with imitation learning outperform the state-of-the-art methods with a large margin.
STMARL: A Spatio-Temporal Multi-Agent Reinforcement Learning Approach for Traffic Light Control
Wang, Yanan, Xu, Tong, Niu, Xin, Tan, Chang, Chen, Enhong, Xiong, Hui
The development of intelligent traffic light control systems is essential for smart transportation management. While some efforts have been made to optimize the use of individual traffic lights in an isolated way, related studies have largely ignored the fact that the use of multi-intersection traffic lights is spatially influenced and there is a temporal dependency of historical traffic status for current traffic light control. To that end, in this paper, we propose a novel SpatioTemporal Multi-Agent Reinforcement Learning (STMARL) framework for effectively capturing the spatio-temporal dependency of multiple related traffic lights and control these traffic lights in a coordinating way. Specifically, we first construct the traffic light adjacency graph based on the spatial structure among traffic lights. Then, historical traffic records will be integrated with current traffic status via Recurrent Neural Network structure. Moreover, based on the temporally-dependent traffic information, we design a Graph Neural Network based model to represent relationships among multiple traffic lights, and the decision for each traffic light will be made in a distributed way by the deep Q-learning method. Finally, the experimental results on both synthetic and real-world data have demonstrated the effectiveness of our STMARL framework, which also provides an insightful understanding of the influence mechanism among multi-intersection traffic lights.
DeepMind details OpenSpiel, a collection of AI training tools for video games
Reinforcement learning, the AI training technique that's brought to fruition systems capable of defeating world poker champions and guiding self-driving cars, isn't the simplest thing in the world to wrangle. That's particularly true in the gaming domain, where cutting-edge approaches sometimes require bespoke tools that aren't publicly available. In a paper recently published on the preprint server Arxiv.org, At its core, it's a collection of environments and algorithms for research in general reinforcement learning and search and planning in games, with tools to analyze learning dynamics and other common evaluation metrics. "The purpose of OpenSpiel is to promote general multiagent reinforcement learning across many different game types, in a similar way as general game-playing but with a heavy emphasis on learning and not in competition form," wrote the researchers.
Intelligent Active Queue Management Using Explicit Congestion Notification
Gomez, Cesar A., Wang, Xianbin, Shami, Abdallah
--As more end devices are getting connected, the Internet will become more congested. Various congestion control techniques have been developed either on transport or network layers. Active Queue Management (AQM) is a paradigm that aims to mitigate the congestion on the network layer through active buffer control to avoid overflow. However, finding the right parameters for an AQM scheme is challenging, due to the complexity and dynamics of the networks. On the other hand, the Explicit Congestion Notification (ECN) mechanism is a solution that makes visible incipient congestion on the network layer to the transport layer. In this work, we propose to exploit the ECN information to improve AQM algorithms by applying Machine Learning techniques. Our intelligent method uses an artificial neural network to predict congestion and an AQM parameter tuner based on reinforcement learning. The evaluation results show that our solution can enhance the performance of deployed AQM, using the existing TCP congestion control mechanisms. Thanks to the proliferation of smart devices and the paradigm of Internet of Things (IoT), the demand for connections to the Internet is dramatically growing.
HyMER: A Hybrid Machine Learning Framework for Energy Efficient Routing in SDN
Assefa, Beakal Gizachew, Ozkasap, Oznur
Combining the capabilities of the programmability of networks by SDN and discovering patterns by machine learning are utilized in security, traffic classification, QoS prediction, and network performance and has attracted the attention of researchers. In this work, we propose HyMER: a novel hybrid machine learning framework for traffic aware energy efficient routing in SDN which has supervised and reinforcement learning components. The supervised learning component consists of feature extraction, training, and testing. The reinforcement learning component learns from existing data or from scratch by iteratively interacting with the network environment. The framework is developed on POX controller and is evaluated on Mininet using Abiline, GEANT, and Nobel-Germany real-world topologies and dynamic traffic traces. Experimental results show that the supervised component achieves up to 70% feature size reduction and more than 80% accuracy in parameter prediction. The refine heuristics algorithm increases the accuracy of the prediction to 100% with 14X to 25X speedup as compared to the brute force method. The reinforcement learning module converges from 100 to 275 iterations and converges twice faster if applied on top of the supervised component. Moreover, HyMER achieves up to 10 watts per switch power saving, 30% link saving, 2 hops decrease in average path length.
Exploration-Enhanced POLITEX
Abbasi-Yadkori, Yasin, Lazic, Nevena, Szepesvari, Csaba, Weisz, Gellert
We study algorithms for average-cost reinforcement learning problems with value function approximation. Our starting point is the recently proposed POLITEX algorithm, a version of policy iteration where the policy produced in each iteration is near-optimal in hindsight for the sum of all past value function estimates. POLITEX has sublinear regret guarantees in uniformly-mixing MDPs when the value estimation error can be controlled, which can be satisfied if all policies sufficiently explore the environment. Unfortunately, this assumption is often unrealistic. Motivated by the rapid growth of interest in developing policies that learn to explore their environment in the lack of rewards (also known as no-reward learning), we replace the previous assumption that all policies explore the environment with that a single, sufficiently exploring policy is available beforehand. The main contribution of the paper is the modification of POLITEX to incorporate such an exploration policy in a way that allows us to obtain a regret guarantee similar to the previous one but without requiring that all policies explore environment. In addition to the novel theoretical guarantees, we demonstrate the benefits of our scheme on environments which are difficult to explore using simple schemes like dithering. While the solution we obtain may not achieve the best possible regret, it is the first result that shows how to control the regret in the presence of function approximation errors on problems where exploration is nontrivial. Our approach can also be seen as a way of reducing the problem of minimizing the regret to learning a good exploration policy. We believe that modular approaches like ours can be highly beneficial in tackling harder control problems.
Ensemble-Based Deep Reinforcement Learning for Chatbots
Cuayáhuitl, Heriberto, Lee, Donghyeon, Ryu, Seonghan, Cho, Yongjin, Choi, Sungja, Indurthi, Satish, Yu, Seunghak, Choi, Hyungtak, Hwang, Inchul, Kim, Jihie
Such an agent is typically characterised by: (i) a finite set of states 6 S {s i} that describe all possible situations in the environment; (ii) a finite set of actions A {a j} to change in the environment from one situation to another; (iii) a state transition function T (s,a,s null) that specifies the next state s null for having taken action a in the current state s; (iv) a reward function R (s,a,s null) that specifies a numerical value given to the agent for taking action a in state s and transitioning to state s null; and (v) a policy π: S A that defines a mapping from states to actions [2, 30]. The goal of a reinforcement learning agent is to find an optimal policy by maximising its cumulative discounted reward defined as Q (s,a) max π E[r t γr t 1 γ 2 r t 1 ... s t s,a t a,π ], where function Q represents the maximum sum of rewards r t discounted by factor γ at each time step. While a reinforcement learning agent takes actions with probability Pr ( a s) during training, it selects the best action at test time according to π (s) arg max a A Q (s,a). A deep reinforcement learning agent approximates Q using a multi-layer neural network [31]. The Q function is parameterised as Q(s,a; θ), where θ are the parameters or weights of the neural network (recurrent neural network in our case). Estimating these weights requires a dataset of learning experiences D {e 1,...e N} (also referred to as'experience replay memory'), where every experience is described as a tuple e t ( s t,a t,r t,s t 1). Inducing a Q function consists in applying Q-learning updates over minibatches of experience MB {( s,a,r,s null) U (D)} drawn uniformly at random from the full dataset D . This process is implemented in learning algorithms using Deep Q-Networks (DQN) such as those described in [31, 32, 33], and the following section describes a DQN-based algorithm for human-chatbot interaction.
Deep Reinforcement Learning for Chatbots Using Clustered Actions and Human-Likeness Rewards
Cuayáhuitl, Heriberto, Lee, Donghyeon, Ryu, Seonghan, Choi, Sungja, Hwang, Inchul, Kim, Jihie
Training chatbots using the reinforcement learning paradigm is challenging due to high-dimensional states, infinite action spaces and the difficulty in specifying the reward function. We address such problems using clustered actions instead of infinite actions, and a simple but promising reward function based on human-likeness scores derived from human-human dialogue data. We train Deep Reinforcement Learning (DRL) agents using chitchat data in raw text---without any manual annotations. Experimental results using different splits of training data report the following. First, that our agents learn reasonable policies in the environments they get familiarised with, but their performance drops substantially when they are exposed to a test set of unseen dialogues. Second, that the choice of sentence embedding size between 100 and 300 dimensions is not significantly different on test data. Third, that our proposed human-likeness rewards are reasonable for training chatbots as long as they use lengthy dialogue histories of >=10 sentences.
Continuous Value Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers
Gerken, Andreas, Spranger, Michael
Continuous V alue Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers Andreas Gerken and Michael Spranger Sony Computer Science Laboratories Inc., Tokyo, Japan Abstract -- This paper presents a novel model-free Reinforcement Learning algorithm for learning behavior in continuous action, state, and goal spaces. The algorithm approximates optimal value functions using nonparametric estimators. It is able to efficiently learn to reach multiple arbitrary goals in deterministic and nondeterministic environments. T o improve generalization in the goal space, we propose a novel sample augmentation technique. Using these methods, robots learn faster and overall better controllers. We benchmark the proposed algorithms using simulation and a real-world voltage controlled robot that learns to maneuver in a non-observable Cartesian task space. I NTRODUCTION Learning to control one's body is a crucial skill for any embodied agent. A common way of framing the problem of learning to control an agent is Reinforcement Learning (RL). RL poses the problem in terms of actions that an agent can perform, observed states of the world and some reward function that pays out a treat or punishes the agent depending on its performance. The aim of an optimal RL controller is to maximize the collected rewards. Reinforcement Learning has been studied widely and applied to different domains of learning and control.
Research on Autonomous Maneuvering Decision of UCAV based on Approximate Dynamic Programming
Hu, Zhencai, Gao, Peng, Wang, Fei
Unmanned aircraft systems can perform some more dangerous and difficult missions than manned aircraft systems. In some highly complicated and changeable tasks, such as air combat, the maneuvering decision mechanism is required to sense the combat situation accurately and make the optimal strategy in real-time. This paper presents a formulation of a 3-D one-on-one air combat maneuvering problem and an approximate dynamic programming approach for computing an optimal policy on autonomous maneuvering decision making. The aircraft learns combat strategies in a Reinforcement Leaning method, while sensing the environment, taking available maneuvering actions and getting feedback reward signals. To solve the problem of dimensional explosion in the air combat, the proposed method is implemented through feature selection, trajectory sampling, function approximation and Bellman backup operation in the air combat simulation environment. This approximate dynamic programming approach provides a fast response to a rapidly changing tactical situation, learns in long-term planning, without any explicitly coded air combat rule base.