Reinforcement Learning
A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management
Casanueva, Iñigo, Budzianowski, Paweł, Su, Pei-Hao, Mrkšić, Nikola, Wen, Tsung-Hsien, Ultes, Stefan, Rojas-Barahona, Lina, Young, Steve, Gašić, Milica
Dialogue assistants are rapidly becoming an indispensable daily aid. To avoid the significant effort needed to hand-craft the required dialogue flow, the Dialogue Management (DM) module can be cast as a continuous Markov Decision Process (MDP) and trained through Reinforcement Learning (RL). Several RL models have been investigated over recent years. However, the lack of a common benchmarking framework makes it difficult to perform a fair comparison between different models and their capability to generalise to different environments. Therefore, this paper proposes a set of challenging simulated environments for dialogue model development and evaluation. To provide some baselines, we investigate a number of representative parametric algorithms, namely deep reinforcement learning algorithms - DQN, A2C and Natural Actor-Critic and compare them to a non-parametric model, GP-SARSA. Both the environments and policy models are implemented using the publicly available PyDial toolkit and released on-line, in order to establish a testbed framework for further experiments and to facilitate experimental reproducibility.
Deep Reinforcement Learning for De-Novo Drug Design
Popova, Mariya, Isayev, Olexandr, Tropsha, Alexander
We propose a novel computational strategy based on deep and reinforcement learning techniques for de-novo design of molecules with desired properties. This strategy integrates two deep neural networks - generative and predictive - that are trained separately but employed jointly to generate novel chemical structures with the desired properties. Generative models are trained to produce chemically feasible SMILES, and predictive models are derived to forecast the desired compound properties. One example of such an approach is the broad use of Lipinski's rules of bioavailability (15, 16) to filter molecules that possess the desired bioactivity in vitro. Indeed, it has been acknowledged that the broad use of these rules has substantially reduced the failure rate in experimental ADME studies of drug candidates (17). The crucial step in many new drug discovery projects is the formulation of a well-motivated hypothesis for new lead compound generation (de novo design) or compound selection from available or synthetically feasible chemical libraries based on the available SAR data. Commonly, an interdisciplinary team of scientists generates the new hypothesis by employing computational models of drug action and relying on their expertise and medicinal chemistry intuition. Therefore, the design hypothesis is often biased towards preferred chemistry (18) or driven by model interpretation (19). Automated approaches for designing compounds with desired properties de novo have become an active field of research in the last 15 years (20, 21). In an attempt to design new compounds, both medicinal and computational chemists face virtually infinite chemical space. Great advances in both computational algorithms(24, 25), hardware, and high-throughput screening (HTS) technologies (16) notwithstanding, the size of this virtual library prohibits its exhaustive sampling and testing by systematic construction and evaluation of each individual compound. Local optimization approaches have been proposed but they do not ensure the optimal solution, as the design process converges on a local or'practical' optimum by stochastic sampling, or restrict the search to a defined section of chemical space which can be screened exhaustively (20, 26-28).
Efficient exploration with Double Uncertain Value Networks
Moerland, Thomas M., Broekens, Joost, Jonker, Catholijn M.
This paper studies directed exploration for reinforcement learning agents by tracking uncertainty about the value of each available action. We identify two sources of uncertainty that are relevant for exploration. The first originates from limited data (parametric uncertainty), while the second originates from the distribution of the returns (return uncertainty). We identify methods to learn these distributions with deep neural networks, where we estimate parametric uncertainty with Bayesian drop-out, while return uncertainty is propagated through the Bellman equation as a Gaussian distribution. Then, we identify that both can be jointly estimated in one network, which we call the Double Uncertain Value Network. The policy is directly derived from the learned distributions based on Thompson sampling. Experimental results show that both types of uncertainty may vastly improve learning in domains with a strong exploration challenge.
Diff-DAC: Distributed Actor-Critic for Multitask Deep Reinforcement Learning
Macua, Sergio Valcarcel, Tukiainen, Aleksi, Hernández, Daniel García-Ocaña, Baldazo, David, de Cote, Enrique Munoz, Zazo, Santiago
We propose a multiagent distributed actor-critic algorithm for multitask reinforcement learning (MRL), named Diff-DAC. The agents are connected, forming a (possibly sparse) network. Each agent is assigned a task and has access to data from this local task only. During the learning process, the agents are able to communicate some parameters to their neighbors. Since the agents incorporate their neighbors' parameters into their own learning rules, the information is diffused across the network, and they can learn a common policy that generalizes well across all tasks. Diff-DAC is scalable since the computational complexity and communication overhead per agent grow with the number of neighbors, rather than with the total number of agents. Moreover, the algorithm is fully distributed in the sense that agents self-organize, with no need for coordinator node. Diff-DAC follows an actor-critic scheme where the value function and the policy are approximated with deep neural networks, being able to learn expressive policies from raw data. As a by-product of Diff-DAC's derivation from duality theory, we provide novel insights into the standard actor-critic framework, showing that it is actually an instance of the dual ascent method to approximate the solution of a linear program. Experiments illustrate the performance of the algorithm in the cart-pole, inverted pendulum, and swing-up cart-pole environments.
Plan, Attend, Generate: Planning for Sequence-to-Sequence Models
Dutil, Francis, Gulcehre, Caglar, Trischler, Adam, Bengio, Yoshua
We investigate the integration of a planning mechanism into sequence-to-sequence models using attention. We develop a model which can plan ahead in the future when it computes its alignments between input and output sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the recently proposed strategic attentive reader and writer (STRAW) model for Reinforcement Learning. Our proposed model is end-to-end trainable using primarily differentiable operations. We show that it outperforms a strong baseline on character-level translation tasks from WMT'15, the algorithmic task of finding Eulerian circuits of graphs, and question generation from the text. Our analysis demonstrates that the model computes qualitatively intuitive alignments, converges faster than the baselines, and achieves superior performance with fewer parameters.
Stochastic approximation for speeding up LSTD (and LSPI)
Prashanth, L. A., Korda, Nathaniel, Munos, Rémi
We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our method results in an $O(d)$ improvement in complexity in comparison to regular LSTD, where $d$ is the dimension of the data. We provide convergence rate results for our proposed method, both in high probability and in expectation. Moreover, we also establish that using our scheme in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function and hence a low-complexity LSPI variant that uses our SA based scheme has the same order of the performance bounds as that of regular LSPI. These rate results coupled with the low complexity of our method make it attractive for implementation in big data settings, where $d$ is large. Furthermore, we analyze a similar low-complexity alternative for least squares regression and provide finite-time bounds there. We demonstrate the practicality of our method for LSTD empirically by combining it with the LSPI algorithm in a traffic signal control application. We also conduct another set of experiments that combines the SA based low-complexity variant for least squares regression with the LinUCB algorithm for contextual bandits, using the large scale news recommendation dataset from Yahoo.
From classic AI techniques to Deep Reinforcement Learning
Building machines that can learn from examples, experience, or even from another machines at human level are the main goal of solving AI. That goal in other words is to create a machine that pass the Turing test: when a human is interacting with it, for the human it will not possible to conclude if it he is interacting with a human or a machine [Turing, A.M 1950]. The fundamental algorithms of deep learning were developed in the middle of 20th century. Since them the field was developed as a theory branch of stochastic operations research and computer science, but without any breakthrough application. But, in the last 20 years the synergy between big data sets, specially labeled data, and augmentation of computer power using graphics processor units, those algorithms have been developed in more complex techniques, technologies and reasoning logics enable to achieve several goals as reducing word error rates in speech recognition; cutting the error rate in an image recognition competition [Krizhevsky et al 2012] and beating a human champion at Go [Silver et al 2016].
Speeding up DQN on PyTorch: how to solve Pong in 30 minutes
Some time ago I've implemented all models from the article Rainbow: Combining Improvements in Deep Reinforcement Learning using PyTorch and my small RL library called PTAN. The code of eight systems is here if you're curious. To debug and test it I've used Pong game from Atari suite, mostly due to its simplicity, fast convergence, and hyperparameters robustness: you can use from 10 to 100 smaller size of replay buffer and it still will converge nicely. This is extremely helpful for a Deep RL enthusiast without access to the computational resources Google employees have. During implementation and debugging of the code, I was needed to run about 100–200 optimisations, so, it does matter how long one run takes: 2–3 days or just an hour. Nevertheless you always should keep a balance here: trying to squeeze as much performance as possible, you can introduce bugs, which will dramatically complicate already complex debugging and implementation process.
Malaria Likelihood Prediction By Effectively Surveying Households Using Deep Reinforcement Learning
Rajpurkar, Pranav, Polamreddi, Vinaya, Balakrishnan, Anusha
We build a deep reinforcement learning (RL) agent that can predict the likelihood of an individual testing positive for malaria by asking questions about their household. The RL agent learns to determine which survey question to ask next and when to stop to make a prediction about their likelihood of malaria based on their responses hitherto. The agent incurs a small penalty for each question asked, and a large reward/penalty for making the correct/wrong prediction; it thus has to learn to balance the length of the survey with the accuracy of its final predictions. Our RL agent is a Deep Q-network that learns a policy directly from the responses to the questions, with an action defined for each possible survey question and for each possible prediction class. We focus on Kenya, where malaria is a massive health burden, and train the RL agent on a dataset of 6481 households from the Kenya Malaria Indicator Survey 2015. To investigate the importance of having survey questions be adaptive to responses, we compare our RL agent to a supervised learning (SL) baseline that fixes its set of survey questions a priori. We evaluate on prediction accuracy and on the number of survey questions asked on a holdout set and find that the RL agent is able to predict with 80% accuracy, using only 2.5 questions on average. In addition, the RL agent learns to survey adaptively to responses and is able to match the SL baseline in prediction accuracy while significantly reducing survey length.
Deep Reinforcement Learning that Matters
Henderson, Peter, Islam, Riashat, Bachman, Philip, Pineau, Joelle, Precup, Doina, Meger, David
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.