Reinforcement Learning
PettingZoo: Gym for Multi-Agent Reinforcement Learning
Terry, Justin K., Black, Benjamin, Jayakumar, Mario, Hari, Ananth, Santos, Luis, Dieffendahl, Clemens, Williams, Niall L., Lokesh, Yashas, Sullivan, Ryan, Horsch, Caroline, Ravi, Praveen
This paper introduces PettingZoo, a library of diverse sets of multi-agent environments under a single elegant Python API. PettingZoo was developed with the goal of accelerating research in multi-agent reinforcement learning, by creating a set of benchmark environments easily accessible to all researchers and a standardized API for the field. This goal is inspired by what OpenAI's Gym library did for accelerating research in single-agent reinforcement learning, and PettingZoo draws heavily from Gym in terms of API and user experience. PettingZoo is unique from other multi-agent environment libraries in that it's API is based on the model of Agent Environment Cycle ("AEC") games, which allows for the sensible representation of all varieties of games under one API for the first time. While retaining a very simple and Gym-like API, PettingZoo still allows access to low-level environment properties required by nontraditional learning methods. Reinforcement Learning ("RL") considers learning a policy -- a function that takes in an observation from an environment and emits an action -- that achieves the maximum expected discounted reward when acting in an environment, and it's capabilities have been one of the great success of modern machine learning. Multi-Agent Reinforcement Learning (MARL) in particular has been behind many of the most publicized achievements of modern machine learning -- AlphaGo Zero (Silver et al., 2017), OpenAI Five (OpenAI, 2018), AlphaStar (Vinyals et al., 2019) -- and has seen a boom in recent years.
Munchausen Reinforcement Learning
Vieillard, Nino, Pietquin, Olivier, Geist, Matthieu
Bootstrapping is a core mechanism in Reinforcement Learning (RL). Most algorithms, based on temporal differences, replace the true value of a transiting state by their current estimate of this value. Yet, another estimate could be leveraged to bootstrap RL: the current policy. Our core contribution stands in a very simple idea: adding the scaled log-policy to the immediate reward. We show that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games, without making use of distributional RL, n-step returns or prioritized replay. To demonstrate the versatility of this idea, we also use it together with an Implicit Quantile Network (IQN). The resulting agent outperforms Rainbow on Atari, installing a new State of the Art with very little modifications to the original algorithm. To add to this empirical study, we provide strong theoretical insights on what happens under the hood -- implicit Kullback-Leibler regularization and increase of the action-gap.
Diversity-Enriched Option-Critic
Temporal abstraction allows reinforcement learning agents to represent knowledge and develop strategies over different temporal scales. The option-critic framework has been demonstrated to learn temporally extended actions, represented as options, end-to-end in a model-free setting. However, feasibility of option-critic remains limited due to two major challenges, multiple options adopting very similar behavior, or a shrinking set of task relevant options. These occurrences not only void the need for temporal abstraction, they also affect performance. In this paper, we tackle these problems by learning a diverse set of options. We introduce an information-theoretic intrinsic reward, which augments the task reward, as well as a novel termination objective, in order to encourage behavioral diversity in the option set. We show empirically that our proposed method is capable of learning options end-to-end on several discrete and continuous control tasks, outperforms option-critic by a wide margin. Furthermore, we show that our approach sustainably generates robust, reusable, reliable and interpretable options, in contrast to option-critic.
Generative Inverse Deep Reinforcement Learning for Online Recommendation
Chen, Xiaocong, Yao, Lina, Sun, Aixin, Wang, Xianzhi, Xu, Xiwei, Zhu, Liming
Deep reinforcement learning enables an agent to capture user's interest through interactions with the environment dynamically. It has attracted great interest in the recommendation research. Deep reinforcement learning uses a reward function to learn user's interest and to control the learning process. However, most reward functions are manually designed; they are either unrealistic or imprecise to reflect the high variety, dimensionality, and non-linearity properties of the recommendation problem. That makes it difficult for the agent to learn an optimal policy to generate the most satisfactory recommendations. To address the above issue, we propose a novel generative inverse reinforcement learning approach, namely InvRec, which extracts the reward function from user's behaviors automatically, for online recommendation. We conduct experiments on an online platform, VirtualTB, and compare with several state-of-the-art methods to demonstrate the feasibility and effectiveness of our proposed approach.
Animal Cognition Induces Common Sense in Artificial Intelligence Agents
Reinforcement learning models are trained, using a similar concept by animal researchers to train animals. For a very long period, artificial intelligence agents were trained on machine learning models to perform tasks that are usually done by humans. The neural networks of machine learning models are designed and trained in such a format that they perform the tasks without any human intervention or supervision. However, ever since its inception, the researchers and scientists are curious to induce cognitive abilities into artificial intelligence agents. For a decade, despite the experiments designed to train the artificial neural network by utilizing the human cognitive ability for adopting common sense, the researchers were unable to reach into a reasonable conclusion. The researchers were resorting to behavioral science and neuroscience earlier to induce common sense into the artificial intelligence agents.
Amortized Variational Deep Q Network
Zhang, Haotian, Wang, Yuhao, Sun, Jianyong, Xu, Zongben
Efficient exploration is one of the most important issues in deep reinforcement learning. To address this issue, recent methods consider the value function parameters as random variables, and resort variational inference to approximate the posterior of the parameters. In this paper, we propose an amortized variational inference framework to approximate the posterior distribution of the action value function in Deep Q Network. We establish the equivalence between the loss of the new model and the amortized variational inference loss. We realize the balance of exploration and exploitation by assuming the posterior as Cauchy and Gaussian, respectively in a two-stage training process. We show that the amortized framework can results in significant less learning parameters than existing state-of-the-art method. Experimental results on classical control tasks in OpenAI Gym and chain Markov Decision Process tasks show that the proposed method performs significantly better than state-of-art methods and requires much less training time.
Specialization in Hierarchical Learning Systems
Hihn, Heinke, Braun, Daniel A.
Joining multiple decision-makers together is a powerful way to obtain more sophisticated decision-making systems, but requires to address the questions of division of labor and specialization. We investigate in how far information constraints in hierarchies of experts not only provide a principled method for regularization but also to enforce specialization. In particular, we devise an information-theoretically motivated on-line learning rule that allows partitioning of the problem space into multiple sub-problems that can be solved by the individual experts. We demonstrate two different ways to apply our method: (i) partitioning problems based on individual data samples and (ii) based on sets of data samples representing tasks. Approach (i) equips the system with the ability to solve complex decision-making problems by finding an optimal combination of local expert decision-makers. Approach (ii) leads to decision-makers specialized in solving families of tasks, which equips the system with the ability to solve meta-learning problems. We show the broad applicability of our approach on a range of problems including classification, regression, density estimation, and reinforcement learning problems, both in the standard machine learning setup and in a meta-learning setting.
Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning
Kallus, Nathan, Uehara, Masatoshi
We study the efficient off-policy evaluation of natural stochastic policies, which are defined in terms of deviations from the behavior policy. This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies. Crucially, offline reinforcement learning with natural stochastic policies can help alleviate issues of weak overlap, lead to policies that build upon current practice, and improve policies' implementability in practice. Compared with the classic case of a pre-specified evaluation policy, when evaluating natural stochastic policies, the efficiency bound, which measures the best-achievable estimation error, is inflated since the evaluation policy itself is unknown. In this paper, we derive the efficiency bounds of two major types of natural stochastic policies: tilting policies and modified treatment policies. We then propose efficient nonparametric estimators that attain the efficiency bounds under very lax conditions. These also enjoy a (partial) double robustness property.
MBVI: Model-Based Value Initialization for Reinforcement Learning
Lyu, Xubo, Li, Site, Siriya, Seth, Pu, Ye, Chen, Mo
Model-free reinforcement learning (RL) is capable of learning control policies for high-dimensional, complex robotic tasks, but tends to be data inefficient. Model-based RL and optimal control have been proven to be much more data-efficient if an accurate model of the system and environment is known, but can be difficult to scale to expressive models for high-dimensional problems. In this paper, we propose a novel approach to alleviate data inefficiency of model-free RL by warm-starting the learning process using model-based solutions. We do so by initializing a high-dimensional value function via supervision from a low-dimensional value function obtained by applying model-based techniques on a low-dimensional problem featuring an approximate system model. Therefore, our approach exploits the model priors from a simplified problem space implicitly and avoids the direct use of high-dimensional, expressive models. We demonstrate our approach on two representative robotic learning tasks and observe significant improvements in performance and efficiency, and analyze our method empirically with a third task.
Deep Reinforcement Learning Based Dynamic Route Planning for Minimizing Travel Time
Geng, Yuanzhe, Liu, Erwu, Wang, Rui, Liu, Yiming
Route planning is important in transportation. Existing works focus on finding the shortest path solution or using metrics such as safety and energy consumption to determine the planning. It is noted that most of these studies rely on prior knowledge of road network, which may be not available in certain situations. In this paper, we design a route planning algorithm based on deep reinforcement learning (DRL) for pedestrians. We use travel time consumption as the metric, and plan the route by predicting pedestrian flow in the road network. We put an agent, which is an intelligent robot, on a virtual map. Different from previous studies, our approach assumes that the agent does not need any prior information about road network, but simply relies on the interaction with the environment. We propose a dynamically adjustable route planning (DARP) algorithm, where the agent learns strategies through a dueling deep Q network to avoid congested roads. Simulation results show that the DARP algorithm saves 52% of the time under congestion condition when compared with traditional shortest path planning algorithms.