Reinforcement Learning
Discovery of Options via Meta-Learned Subgoals
Veeriah, Vivek, Zahavy, Tom, Hessel, Matteo, Xu, Zhongwen, Oh, Junhyuk, Kemaev, Iurii, van Hasselt, Hado, Silver, David, Singh, Satinder
Temporal abstractions in the form of options have been shown to help reinforcement learning (RL) agents learn faster. However, despite prior work on this topic, the problem of discovering options through interaction with an environment remains a challenge. In this paper, we introduce a novel meta-gradient approach for discovering useful options in multi-task RL environments. Our approach is based on a manager-worker decomposition of the RL agent, in which a manager maximises rewards from the environment by learning a task-dependent policy over both a set of task-independent discovered-options and primitive actions. The option-reward and termination functions that define a subgoal for each option are parameterised as neural networks and trained via meta-gradients to maximise their usefulness. Empirical analysis on gridworld and DeepMind Lab tasks show that: (1) our approach can discover meaningful and diverse temporally-extended options in multi-task RL domains, (2) the discovered options are frequently used by the agent while learning to solve the training tasks, and (3) that the discovered options help a randomly initialised manager learn faster in completely new tasks.
Optimization Issues in KL-Constrained Approximate Policy Iteration
Lazić, Nevena, Hao, Botao, Abbasi-Yadkori, Yasin, Schuurmans, Dale, Szepesvári, Csaba
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API). While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies, arguing that this is easier to implement and tune. In this work, we study this implementation choice in more detail. We compare the use of KL divergence as a constraint vs. as a regularizer, and point out several optimization issues with the widely-used constrained approach. We show that the constrained algorithm is not guaranteed to converge even on simple problem instances where the constrained problem can be solved exactly, and in fact incurs linear expected regret. With approximate implementation using softmax policies, we show that regularization can improve the optimization landscape of the original objective. We demonstrate these issues empirically on several bandit and RL environments.
Feedback in Imitation Learning: The Three Regimes of Covariate Shift
Spencer, Jonathan, Choudhury, Sanjiban, Venkatraman, Arun, Ziebart, Brian, Bagnell, J. Andrew
Imitation learning practitioners have often noted that conditioning policies on previous actions leads to a dramatic divergence between "held out" error and performance of the learner in situ. Interactive approaches can provably address this divergence but require repeated querying of a demonstrator. Recent work identifies this divergence as stemming from a "causal confound" in predicting the current action, and seek to ablate causal aspects of current state using tools from causal inference. In this work, we argue instead that this divergence is simply another manifestation of covariate shift, exacerbated particularly by settings of feedback between decisions and input features. The learner often comes to rely on features that are strongly predictive of decisions, but are subject to strong covariate shift. Our work demonstrates a broad class of problems where this shift can be mitigated, both theoretically and practically, by taking advantage of a simulator but without any further querying of expert demonstration. We analyze existing benchmarks used to test imitation learning approaches and find that these benchmarks are realizable and simple and thus insufficient for capturing the harder regimes of error compounding seen in real-world decision making problems. We find, in a surprising contrast with previous literature, but consistent with our theory, that naive behavioral cloning provides excellent results. We detail the need for new standardized benchmarks that capture the phenomena seen in robotics problems.
Hedging of Financial Derivative Contracts via Monte Carlo Tree Search
The construction of approximate replication strategies for derivative contracts in incomplete markets is a key problem of financial engineering. Recently Reinforcement Learning algorithms for pricing and hedging under realistic market conditions have attracted significant interest. While financial research mostly focused on variations of Q-learning, in Artificial Intelligence Monte Carlo Tree Search is the recognized stateof-the-art method for various planning problems, such as the games of Hex, Chess, Go,... This article introduces Monte Carlo Tree Search as a method to solve the stochastic optimal control problem underlying the pricing and hedging of financial derivatives. As compared to Q-learning it combines reinforcement learning with tree search techniques. As a consequence Monte Carlo Tree Search has higher sample efficiency, is less prone to over-fitting to specific market models and generally learns stronger policies faster. In our experiments we find that Monte Carlo Tree Search, being the world-champion in games like Chess and Go, is easily capable of directly maximizing the utility of investor's terminal wealth without an intermediate mathematical theory.
Echo State Networks for Reinforcement Learning
Hart, Allen G., Olding, Kevin R., Cox, A. M. G., Isupova, Olga, Dawes, J. H. P.
Echo State Networks (ESNs) are a type of single-layer recurrent neural network with randomly-chosen internal weights and a trainable output layer. We prove under mild conditions that a sufficiently large Echo State Network (ESN) can approximate the value function of a broad class of stochastic and deterministic control problems. Such control problems are generally non-Markovian. We describe how the ESN can form the basis for novel (and computationally efficient) reinforcement learning algorithms in a non-Markovian framework. We demonstrate this theory with two examples. In the first, we use an ESN to solve a deterministic, partially observed, control problem which is a simple game we call `Bee World'. In the second example, we consider a stochastic control problem inspired by a market making problem in mathematical finance. In both cases we can compare the dynamics of the algorithms with analytic solutions to show that even after only a single reinforcement policy iteration the algorithms perform with reasonable skill.
Knowledge Infused Policy Gradients for Adaptive Pandemic Control
Roy, Kaushik, Zhang, Qi, Gaur, Manas, Sheth, Amit
COVID-19 has impacted nations differently based on their policy implementations. The effective policy requires taking into account public information and adaptability to new knowledge. Epidemiological models built to understand COVID-19 seldom provide the policymaker with the capability for adaptive pandemic control (APC). Among the core challenges to be overcome include (a) inability to handle a high degree of non-homogeneity in different contributing features across the pandemic timeline, (b) lack of an approach that enables adaptive incorporation of public health expert knowledge, and (c) transparent models that enable understanding of the decision-making process in suggesting policy. In this work, we take the early steps to address these challenges using Knowledge Infused Policy Gradient (KIPG) methods. Prior work on knowledge infusion does not handle soft and hard imposition of varying forms of knowledge in disease information and guidelines to necessarily comply with. Furthermore, the models do not attend to non-homogeneity in feature counts, manifesting as partial observability in informing the policy. Additionally, interpretable structures are extracted post-learning instead of learning an interpretable model required for APC. To this end, we introduce a mathematical framework for KIPG methods that can (a) induce relevant feature counts over multi-relational features of the world, (b) handle latent non-homogeneous counts as hidden variables that are linear combinations of kernelized aggregates over the features, and (b) infuse knowledge as functional constraints in a principled manner. The study establishes a theory for imposing hard and soft constraints and simulates it through experiments. In comparison with knowledge-intensive baselines, we show quick sample efficient adaptation to new knowledge and interpretability in the learned policy, especially in a pandemic context.
Deep Reinforcement Agent for Scheduling in HPC
Fan, Yuping, Lan, Zhiling, Childers, Taylor, Rich, Paul, Allcock, William, Papka, Michael E.
Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a novel, hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. A unique training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. The experiments with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 45%.
Multi-Task Reinforcement Learning with Context-based Representations
Sodhani, Shagun, Zhang, Amy, Pineau, Joelle
The benefit of multi-task learning over single-task learning relies on the ability to use relations across tasks to improve performance on any single task. While sharing representations is an important mechanism to share information across tasks, its success depends on how well the structure underlying the tasks is captured. In some real-world situations, we have access to metadata, or additional information about a task, that may not provide any new insight in the context of a single task setup alone but inform relations across multiple tasks. While this metadata can be useful for improving multi-task learning performance, effectively incorporating it can be an additional challenge. We posit that an efficient approach to knowledge transfer is through the use of multiple context-dependent, composable representations shared across a family of tasks. In this framework, metadata can help to learn interpretable representations and provide the context to inform which representations to compose and how to compose them. We use the proposed approach to obtain state-of-the-art results in Meta-World, a challenging multi-task benchmark consisting of 50 distinct robotic manipulation tasks.
Improving Model-Based Reinforcement Learning with Internal State Representations through Self-Supervision
Scholz, Julien, Weber, Cornelius, Hafez, Muhammad Burhan, Wermter, Stefan
Using a model of the environment, reinforcement learning agents can plan their future moves and achieve superhuman performance in board games like Chess, Shogi, and Go, while remaining relatively sample-efficient. As demonstrated by the MuZero Algorithm, the environment model can even be learned dynamically, generalizing the agent to many more tasks while at the same time achieving state-of-the-art performance. Notably, MuZero uses internal state representations derived from real environment states for its predictions. In this paper, we bind the model's predicted internal state representation to the environment state via two additional terms: a reconstruction model loss and a simpler consistency loss, both of which work independently and unsupervised, acting as constraints to stabilize the learning process. Our experiments show that this new integration of reconstruction model loss and simpler consistency loss provide a significant performance increase in OpenAI Gym environments. Our modifications also enable self-supervised pretraining for MuZero, so the algorithm can learn about environment dynamics before a goal is made available.
Learning Equational Theorem Proving
Piepenbrock, Jelle, Heskes, Tom, Janota, Mikoláš, Urban, Josef
We develop Stratified Shortest Solution Imitation Learning (3SIL) to learn equational theorem proving in a deep reinforcement learning (RL) setting. The self-trained models achieve state-of-the-art performance in proving problems generated by one of the top open conjectures in quasigroup theory, the Abelian Inner Mapping (AIM) Conjecture. To develop the methods, we first use two simpler arithmetic rewriting tasks that share tree-structured proof states and sparse rewards with the AIM problems. On these tasks, 3SIL is shown to significantly outperform several established RL and imitation learning methods. The final system is then evaluated in a standalone and cooperative mode on the AIM problems. The standalone 3SIL-trained system proves in 60 seconds more theorems (70.2%) than the complex, hand-engineered Waldmeister system (65.5%). In the cooperative mode, the final system is combined with the Prover9 system, proving in 2 seconds what standalone Prover9 proves in 60 seconds.