Reinforcement Learning
Jelly Bean World: A Testbed for Never-Ending Learning
Platanios, Emmanouil Antonios, Saparov, Abulhair, Mitchell, Tom
Machine learning has shown growing success in recent years. However, current machine learning systems are highly specialized, trained for particular problems or domains, and typically on a single narrow dataset. Human learning, on the other hand, is highly general and adaptable. Never-ending learning is a machine learning paradigm that aims to bridge this gap, with the goal of encouraging researchers to design machine learning systems that can learn to perform a wider variety of inter-related tasks in more complex environments. To date, there is no environment or testbed to facilitate the development and evaluation of never-ending learning systems. To this end, we propose the Jelly Bean World testbed. The Jelly Bean World allows experimentation over two-dimensional grid worlds which are filled with items and in which agents can navigate. This testbed provides environments that are sufficiently complex and where more generally intelligent algorithms ought to perform better than current state-of-the-art reinforcement learning approaches. It does so by producing non-stationary environments and facilitating experimentation with multi-task, multi-agent, multi-modal, and curriculum learning settings. We hope that this new freely-available software will prompt new research and interest in the development and evaluation of never-ending learning systems and more broadly, general intelligence systems.
Deep RL Agent for a Real-Time Action Strategy Game
Warchalski, Michal, Radojevic, Dimitrije, Milosevic, Milos
We introduce a reinforcement learning environment based on Heroic - Magic Duel, a 1 v 1 action strategy game. This domain is non-trivial for several reasons: it is a real-time game, the state space is large, the information given to the player before and at each step of a match is imperfect, and distribution of actions is dynamic. Our main contribution is a deep reinforcement learning agent playing the game at a competitive level that we trained using PPO and self-play with multiple competing agents, employing only a simple reward of $\pm 1$ depending on the outcome of a single match. Our best self-play agent, obtains around $65\%$ win rate against the existing AI and over $50\%$ win rate against a top human player.
Reinforcement Learning Enhanced Quantum-inspired Algorithm for Combinatorial Optimization
Beloborodov, Dmitrii, Ulanov, A. E., Foerster, Jakob N., Whiteson, Shimon, Lvovsky, A. I.
Quantum hardware and quantum-inspired algorithms are becoming increasingly popular for combinatorial optimization. However, these algorithms may require careful hyperparameter tuning for each problem instance. We use a reinforcement learning agent in conjunction with a quantum-inspired algorithm to solve the Ising energy minimization problem, which is equivalent to the Maximum Cut problem. The agent controls the algorithm by tuning one of its parameters with the goal of improving recently seen solutions. We propose a new Rescaled Ranked Reward (R3) method that enables stable single-player version of self-play training that helps the agent to escape local optima. The training on any problem instance can be accelerated by applying transfer learning from an agent trained on randomly generated problems. Our approach allows sampling high-quality solutions to the Ising problem with high probability and outperforms both baseline heuristics and a black-box hyperparameter optimization approach.
Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills
Campos, Víctor, Trott, Alexander, Xiong, Caiming, Socher, Richard, Giro-i-Nieto, Xavier, Torres, Jordi
Acquiring abilities in the absence of a task-oriented reward function is at the frontier of reinforcement learning research. This problem has been studied through the lens of empowerment, which draws a connection between option discovery and information theory. Information-theoretic skill discovery methods have garnered much interest from the community, but little research has been conducted in understanding their limitations. Through theoretical analysis and empirical evidence, we show that existing algorithms suffer from a common limitation -- they discover options that provide a poor coverage of the state space. In light of this, we propose 'Explore, Discover and Learn' (EDL), an alternative approach to information-theoretic skill discovery. Crucially, EDL optimizes the same information-theoretic objective derived from the empowerment literature, but addresses the optimization problem using different machinery. We perform an extensive evaluation of skill discovery methods on controlled environments and show that EDL offers significant advantages, such as overcoming the coverage problem, reducing the dependence of learned skills on the initial state, and allowing the user to define a prior over which behaviors should be learned.
RL agents Implicitly Learning Human Preferences
In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.
Extended Markov Games to Learn Multiple Tasks in Multi-Agent Reinforcement Learning
León, Borja G., Belardinelli, Francesco
This paper focus on formally extending Markov Learning (RL) has recently attracted interest as a way for singleagent Games (MGs), the mathematical model that is traditionally used in RL to learn multiple-task specifications. In this paper we extend MARL, to build a new general model, i.e, not focused solely in one this convergence to multi-agent settings and formally define Extended kind of multi-agent game, that allows multiple learning agents to Markov Games as a general mathematical model that allows concurrently fulfill various non-Markovian specifications in multiagent multiple RL agents to concurrently learn various non-Markovian settings. To support our model with empirical evidence, we specifications. To introduce this new model we provide formal definitions also extended two logic-based RL algorithms to multi-agents systems and proofs as well as empirical tests of RL algorithms running in order to show how various learning agents can fulfill different on this framework. Specifically, we use our model to train two different types of non-Markovian specifications expressed in co-safe- Lineartime logic-based multi-agent RL algorithms to solve diverse settings Temporal Logic (LT L). Our results are promising and point to of non-Markovian co-safe LT L specifications.
Learning Functionally Decomposed Hierarchies for Continuous Control Tasks
Jendele, Lukas, Christen, Sammy, Aksan, Emre, Hilliges, Otmar
Solving long-horizon sequential decision making tasks in environments with sparse rewards is a longstanding problem in reinforcement learning (RL) research. Hierarchical Reinforcement Learning (HRL) has held the promise to enhance the capabilities of RL agents via operation on different levels of temporal abstraction. Despite the success of recent works in dealing with inherent nonstationarity and sample complexity, it remains difficult to generalize to unseen environments and to transfer different layers of the policy to other agents. In this paper, we propose a novel HRL architecture, Hierarchical Decompositional Reinforcement Learning (HiDe), which allows decomposition of the hierarchical layers into independent subtasks, yet allows for joint training of all layers in end-to-end manner. The main insight is to combine a control policy on a lower level with an image-based planning policy on a higher level. We evaluate our method on various complex continuous control tasks, demonstrating that generalization across environments and transfer of higher level policies, such as from a simple ball to a complex humanoid, can be achieved. See videos https://sites.google.com/view/hide-rl.
Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic
Ren, Yangang, Duan, Jingliang, Guan, Yang, Li, Shengbo Eben
Reinforcement learning (RL) has achieved remarkable performance in a variety of sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be extended to situations never encountered during training. For practical applications, the randomness of the environment usually leads to rare but devastating events, which should be the focus of safety-critical systems, such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft Actor-Critic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most serious disturbances from environment, in which the protagonist policy maximizes action-value function while the adversary policy tries to minimize it. Distributional framework aims to learn a state-action return distribution, from which we can model the risk of different returns explicitly, thus, formulating a risk-averse protagonist policy and a risk-seeking adversarial policy. We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments from training environment. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.
Frequency-based Search-control in Dyna
Pan, Yangchen, Mei, Jincheng, Farahmand, Amir-massoud
Model-based reinforcement learning has been empirically demonstrated as a successful strategy to improve sample efficiency. In particular, Dyna is an elegant model-based architecture integrating learning and planning that provides huge flexibility of using a model. One of the most important components in Dyna is called search-control, which refers to the process of generating state or state-action pairs from which we query the model to acquire simulated experiences. Search-control is critical in improving learning efficiency. In this work, we propose a simple and novel search-control strategy by searching high frequency regions of the value function. Our main intuition is built on Shannon sampling theorem from signal processing, which indicates that a high frequency signal requires more samples to reconstruct. We empirically show that a high frequency function is more difficult to approximate. This suggests a search-control strategy: we should use states from high frequency regions of the value function to query the model to acquire more samples. We develop a simple strategy to locally measure the frequency of a function by gradient and hessian norms, and provide theoretical justification for this approach. We then apply our strategy to search-control in Dyna, and conduct experiments to show its property and effectiveness on benchmark domains.
XCS Classifier System with Experience Replay
Stein, Anthony, Maier, Roland, Rosenbauer, Lukas, Hähner, Jörg
XCS constitutes the most deeply investigated classifier system today. It bears strong potentials and comes with inherent capabilities for mastering a variety of different learning tasks. Besides outstanding successes in various classification and regression tasks, XCS also proved very effective in certain multi-step environments from the domain of reinforcement learning. Especially in the latter domain, recent advances have been mainly driven by algorithms which model their policies based on deep neural networks -- among which the Deep-Q-Network (DQN) is a prominent representative. Experience Replay (ER) constitutes one of the crucial factors for the DQN's successes, since it facilitates stabilized training of the neural network-based Q-function approximators. Surprisingly, XCS barely takes advantage of similar mechanisms that leverage stored raw experiences encountered so far. To bridge this gap, this paper investigates the benefits of extending XCS with ER. On the one hand, we demonstrate that for single-step tasks ER bears massive potential for improvements in terms of sample efficiency. On the shady side, however, we reveal that the use of ER might further aggravate well-studied issues not yet solved for XCS when applied to sequential decision problems demanding for long-action-chains.