Reinforcement Learning
Discrete and Continuous Action Representation for Practical RL in Video Games
Delalleau, Olivier, Peter, Maxim, Alonso, Eloi, Logut, Adrien
Olivier Delalleau * 1, Maxim Peter *, Eloi Alonso, Adrien Logut Ubisoft La Forge Abstract While most current research in Reinforcement Learning (RL) focuses on improving the performance of the algorithms in controlled environments, the use of RL under constraints like those met in the video game industry is rarely studied. Operating under such constraints, we propose Hybrid SAC, an extension of the Soft Actor-Critic algorithm able to handle discrete, continuous and parameterized actions in a principled way. We show that Hybrid SAC can successfully solve a high-speed driving task in one of our games, and is competitive with the state-of-the-art on parameterized actions benchmark tasks. We also explore the impact of using normalizing flows to enrich the expressiveness of the policy at minimal computational cost, and identify a potential undesired effect of SAC when used with normalizing flows, that may be addressed by optimizing a different objective. Introduction Reinforcement Learning (RL) applications in video games have recently seen massive advances coming from the research community, with agents trained to play Atari games from pixels (Mnih et al. 2015) or to be competitive with the best players in the world in complicated imperfect information games like DOT A 2 (OpenAI 2018) or StarCraft II (Vinyals et al. 2019a; 2019b). These systems have comparatively seen little use within the video game industry, and we believe lack of accessibility to be a major reason behind this. Indeed, really impressive results like those cited above are produced by large research groups with computational resources well beyond what is typically available within video game studios. Our contributions are geared towards industry practitioners, by sharing experiments and practical advice for using RL with a different set of constraints than those met in the research community.
Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning
Li, Richard, Jabri, Allan, Darrell, Trevor, Agrawal, Pulkit
Learning robotic manipulation tasks using reinforcement learning with sparse rewards is currently impractical due to the outrageous data requirements. Many practical tasks require manipulation of multiple objects, and the complexity of such tasks increases with the number of objects. Learning from a curriculum of increasingly complex tasks appears to be a natural solution, but unfortunately, does not work for many scenarios. We hypothesize that the inability of the state-of-the-art algorithms to effectively utilize a task curriculum stems from the absence of inductive biases for transferring knowledge from simpler to complex tasks. We show that graph-based relational architectures overcome this limitation and enable learning of complex tasks when provided with a simple curriculum of tasks with increasing numbers of objects. We demonstrate the utility of our framework on a simulated block stacking task. Starting from scratch, our agent learns to stack six blocks into a tower. Despite using step-wise sparse rewards, our method is orders of magnitude more data-efficient and outperforms the existing state-of-the-art method that utilizes human demonstrations. Furthermore, the learned policy exhibits zero-shot generalization, successfully stacking blocks into taller towers and previously unseen configurations such as pyramids, without any further training.
OpenAI Open Sources Safety Gym to Improve Safety in Reinforcement Learning Agents
Safety is one of the emerging concerns in deep learning systems. In the context of deep learning systems, safety is related to building agents that respect safety dynamics in a given environment. In many cases such as supervised learning, safety is modeled as part of the training datasets. However, other methods such as reinforcement learning require agents to master the dynamics of the environments by experimenting with it which introduces its own set of safety concerns. To address some of these challenges, OpenAI has recently open sourced Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints while training.
Finite-Time Analysis and Restarting Scheme for Linear Two-Time-Scale Stochastic Approximation
Motivated by their broad applications in reinforcement learning, we study the linear two-time-scale stochastic approximation, an iterative method using two different step sizes for finding the solutions of a system of two equations. Our main focus is to characterize the finite-time complexity of this method under time-varying step sizes and Markovian noise. In particular, we show that the mean square errors of the variables generated by the method converge to zero at a sublinear rate $\mathcal{O}(k^{2/3})$, where $k$ is the number of iterations. We then improve the performance of this method by considering the restarting scheme, where we restart the algorithm after a predetermined number of iterations. We show that using this restarting method the complexity of the algorithm under time-varying step sizes is as good as the one using constant step sizes, but still achieving an exact converge to the desired solution. Moreover, the restarting scheme also helps to prevent the step sizes from getting too small, which is useful for the practical implementation of the linear two-time-scale stochastic approximation.
Direct and indirect reinforcement learning
Guan, Yang, Li, Shengbo Eben, Duan, Jingliang, Li, Jie, Ren, Yangang, Cheng, Bo
Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. In this paper, we classify RL into direct and indirect methods according to how they seek optimal policy of the Markov Decision Process (MDP) problem. The former solves optimal policy by directly maximizing an objective function using gradient descent method, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We take vanilla policy gradient and approximate policy iteration to study their internal relationship, and reveal that both direct and indirect methods can be unified in actor-critic architecture and are equivalent if we always choose stationary state distribution of current policy as initial state distribution of MDP. Finally, we classify the current mainstream RL algorithms and compare the differences between other criteria including value-based and policy-based, model-based and model-free.
Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning
Tan, Tian, Xiong, Zhihan, Dwaracherla, Vikranth R.
It is well known that quantifying uncertainty in the action-value estimates is crucial for efficient exploration in reinforcement learning. Ensemble sampling offers a relatively computationally tractable way of doing this using randomized value functions. However, it still requires a huge amount of computational resources for complex problems. In this paper, we present an alternative, computationally efficient way to induce exploration using index sampling. We use an indexed value function to represent uncertainty in our action-value estimates. We first present an algorithm to learn parameterized indexed value function through a distributional version of temporal difference in a tabular setting and prove its regret bound. Then, in a computational point of view, we propose a dual-network architecture, Parameterized Indexed Networks (PINs), comprising one mean network and one uncertainty network to learn the indexed value function. Finally, we show the efficacy of PINs through computational experiments.
The 10 Algorithms Data Scientist must have to Know.
Let's say I am given an Excel sheet with data about various fruits and I have to tell which look like Apples. What I will do is ask a question "Which fruits are red and round?" and divide all fruits which answer yes and no to the question. Now, All Red and Round fruits might not be apples and all apples won't be red and round. So I will ask a question "Which fruits have red or yellow color hints on them? " on red and round fruits and will ask "Which fruits are green and round?" on not red and round fruits. Based on these questions I can tell with considerable accuracy which are apples. This cascade of questions is what a decision tree is. However, this is a decision tree based on my intuition.
Can Agents Learn by Analogy? An Inferable Model for PAC Reinforcement Learning
Model-based reinforcement learning algorithms make decisions by building and utilizing a model of the environment. However, none of the existing algorithms attempts to infer the dynamics of any state-action pair from known state-action pairs before meeting it for sufficient times. We propose a new model-based method called Greedy Inference Model (GIM) that infers the unknown dynamics from known dynamics based on the internal spectral properties of the environment. In other words, GIM can "learn by analogy". We further introduce a new exploration strategy which ensures that the agent rapidly and evenly visits unknown state-action pairs. GIM is much more computationally efficient than state-of-the-art model-based algorithms, as the number of dynamic programming operations is independent of the environment size. Lower sample complexity could also be achieved under mild conditions compared against methods without inferring. Experimental results demonstrate the effectiveness and efficiency of GIM in a variety of real-world tasks.
Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes
Roy, Arghyadip, Borkar, Vivek, Karandikar, Abhay, Chaporkar, Prasanna
Markov Decision Process (MDP) problems can be solved using Dynamic Programming (DP) methods which suffer from the curse of dimensionality and the curse of modeling. To overcome these issues, Reinforcement Learning (RL) methods are adopted in practice. In this paper, we aim to obtain the optimal admission control policy in a system where different classes of customers are present. Using DP techniques, we prove that it is optimal to admit the $i$ th class of customers only upto a threshold $\tau(i)$ which is a non-increasing function of $i$. Contrary to traditional RL algorithms which do not take into account the structural properties of the optimal policy while learning, we propose a structure-aware learning algorithm which exploits the threshold structure of the optimal policy. We prove the asymptotic convergence of the proposed algorithm to the optimal policy. Due to the reduction in the policy space, the structure-aware learning algorithm provides remarkable improvements in storage and computational complexities over classical RL algorithms. Simulation results also establish the gain in the convergence rate of the proposed algorithm over other RL algorithms. The techniques presented in the paper can be applied to any general MDP problem covering various applications such as inventory management, financial planning and communication networking.
Exploring TD error as a heuristic for $\sigma$ selection in Q($\sigma$, $\lambda$)
In the landscape of TD algorithms, the Q( σ,λ) algorithm is an algorithm with the ability to perform a multi-step backup in an online manner while also successfully unifying the concepts of sampling with using the expectation across all actions for a state. Selecting the value of σ can be based on characteristics of the current state rather than having a constant value or being time based. This project explores the viability of such a TD-error based scheme. Introduction While having different dimensions of generalizability in an algorithm can serve as a powerful tool, in most cases it comes with the associated burden of having to manually select values along these dimensions, commonly referred to as hyper-parameter selection. In case of learning algorithms, an ideal algorithm would be completely general, even to the point that they do not need a fixed set of hyper-parameters for which they perform optimally for a given problem. In the context of Q( σ,λ), the introduction of the σ parameter gives us flexibility in terms of adjusting the proportion of sampling and expectation we want in our updates. But at the same time, while σ does serve as a hyper-parameter, atypically a constant value of σ was found to not have the best performance by De Asis, Hernandez-Garcia, Holland and Sutton (2018). They used a Dynamic Decay σ scheme for n-step Q( σ) where they reduced the value of σ after every episode by a factor of 0.95.