Reinforcement Learning
A Framework for Reinforcement Learning and Planning
Moerland, Thomas M., Broekens, Joost, Jonker, Catholijn M.
Sequential decision making, commonly formalized as Markov Decision Process optimization, is a key challenge in artificial intelligence. Two successful approaches to MDP optimization are planning and reinforcement learning. Both research fields largely have their own research communities. However, if both research fields solve the same problem, then we should be able to disentangle the common factors in their solution approaches. Therefore, this paper presents a unifying framework for reinforcement learning and planning (FRAP), which identifies the underlying dimensions on which any planning or learning algorithm has to decide. At the end of the paper, we compare - in a single table - a variety of well-known planning, model-free and model-based RL algorithms along the dimensions of our framework, illustrating the validity of the framework. Altogether, FRAP provides deeper insight into the algorithmic space of planning and reinforcement learning, and also suggests new approaches to integration of both fields.
Model-based Reinforcement Learning: A Survey
Moerland, Thomas M., Broekens, Joost, Jonker, Catholijn M.
Sequential decision making, commonly formalized as Markov Decision Process (MDP) optimization, is a key challenge in artificial intelligence. Two key approaches to this problem are reinforcement learning (RL) and planning. This paper presents a survey of the integration of both fields, better known as model-based reinforcement learning. Model-based RL has two main steps. First, we systematically cover approaches to dynamics model learning, including challenges like dealing with stochasticity, uncertainty, partial observability, and temporal abstraction. Second, we present a systematic categorization of planning-learning integration, including aspects like: where to start planning, what budgets to allocate to planning and real data collection, how to plan, and how to integrate planning in the learning and acting loop. After these two key sections, we also discuss the potential benefits of model-based RL, like enhanced data efficiency, targeted exploration, and improved stability. Along the survey, we also draw connections to several related RL fields, like hierarchical RL and transfer, and other research disciplines, like behavioural psychology. Altogether, the survey presents a broad conceptual overview of planning-learning combinations for MDP optimization.
Invariant Policy Optimization: Towards Stronger Generalization in Reinforcement Learning
Sonar, Anoopkumar, Pacelli, Vincent, Majumdar, Anirudha
A fundamental challenge in reinforcement learning is to learn policies that generalize beyond the operating domains experienced during training. In this paper, we approach this challenge through the following invariance principle: an agent must find a representation such that there exists an action-predictor built on top of this representation that is simultaneously optimal across all training domains. Intuitively, the resulting invariant policy enhances generalization by finding causes of successful actions. We propose a novel learning algorithm, Invariant Policy Optimization (IPO), that implements this principle and learns an invariant policy during training. We compare our approach with standard policy gradient methods and demonstrate significant improvements in generalization performance on unseen domains for linear quadratic regulator and grid-world problems, and an example where a robot must learn to open doors with varying physical properties.
Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System
Wang, Jianhong, Zhang, Yuan, Kim, Tae-Kyun, Gu, Yunjie
Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In our work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO; (2) train HDNO with hierarchical reinforcement learning (HRL), as well as suggest alternating updates between dialogue policy and NLG during HRL inspired by fictitious play, to preserve the comprehensibility of generated system utterances while improving fulfilling user requests; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing a significant improvement on the total performance evaluated with automatic metrics.
Bridging the Imitation Gap by Adaptive Insubordination
Weihs, Luca, Jain, Unnat, Salvador, Jordi, Lazebnik, Svetlana, Kembhavi, Aniruddha, Schwing, Alexander
Why do agents often obtain better reinforcement learning policies when imitating a worse expert? We show that privileged information used by the expert is marginalized in the learned agent policy, resulting in an "imitation gap." Prior work bridges this gap via a progression from imitation learning to reinforcement learning. While often successful, gradual progression fails for tasks that require frequent switches between exploration and memorization skills. To better address these tasks and alleviate the imitation gap we propose 'Adaptive Insubordination' (ADVISOR), which dynamically reweights imitation and reward-based reinforcement learning losses during training, enabling switching between imitation and exploration. On a suite of challenging tasks, we show that ADVISOR outperforms pure imitation, pure reinforcement learning, as well as sequential combinations of these approaches.
Explore More and Improve Regret in Linear Quadratic Regulators
Lale, Sahin, Azizzadenesheli, Kamyar, Hassibi, Babak, Anandkumar, Anima
Stabilizing the unknown dynamics of a control system and minimizing regret in control of an unknown system are among the main goals in control theory and reinforcement learning. In this work, we pursue both these goals for adaptive control of linear quadratic regulators (LQR). Prior works accomplish either one of these goals at the cost of the other one. The algorithms that are guaranteed to find a stabilizing controller suffer from high regret, whereas algorithms that focus on achieving low regret assume the presence of a stabilizing controller at the early stages of agent-environment interaction. In the absence of such a stabilizing controller, at the early stages, the lack of reasonable model estimates needed for (i) strategic exploration and (ii) design of controllers that stabilize the system, results in regret that scales exponentially in the problem dimensions. We propose a framework for adaptive control that exploits the characteristics of linear dynamical systems and deploys additional exploration in the early stages of agent-environment interaction to guarantee sooner design of stabilizing controllers. We show that for the classes of controllable and stabilizable LQRs, where the latter is a generalization of prior work, these methods achieve $\tilde{\mathcal{O}}(\sqrt{T})$ regret with a polynomial dependence in the problem dimensions.
Challenging common bolus advisor for self-monitoring type-I diabetes patients using Reinforcement Learning
Logé, Frédéric, Pennec, Erwan Le, Amadou-Boubacar, Habiboulaye
A lot of the research around blood glucose management for diabetes focuses on the artificial pancreas, so the case Patients with diabetes who are self-monitoring have to decide right where the patient is equipped with an insulin pump. The interested before each meal how much insulin they should take. A standard bolus reader can find an extensive review here [1]. For self-monitoring, advisor exists, but has never actually been proven to be optimal [6] worked on the best delivery of insulin drugs to facilitate BG in any sense. We challenged this rule applying Reinforcement Learning management. Based on a complex diabetes simulator, the authors techniques on data simulated with T1DM, an FDAapproved of [2] and [7] worked on learning adaptively coefficients (CIR, CF) simulator developped by [3] modeling the gluco-insulin interaction.
Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
Wei, Chen-Yu, Jafarnia-Jahromi, Mehdi, Luo, Haipeng, Jain, Rahul
We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $\widetilde{O}(\sqrt{T})$ regret and another computationally efficient variant with $\widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with $\widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).
Behind DeepMind's Framework That Discovers New RL Algorithms
DeepMind recently introduced a new meta-learning approach that generates a reinforcement learning algorithm known as Learned Policy Gradient (LPG). According to the researchers, automating the discovery of update rules from data could lead to more efficient algorithms that could also be better adapted to specific environments. That one technique of machine learning which can be compared with the psychological behaviour of animals is reinforcement learning. The objective of reinforcement learning is to maximise the expected cumulative rewards or average rewards. This algorithm has gained much traction by researchers and developers over the past few years.
Reinforcement Learning Starts to Deliver on Its Promise
Summary: Advances in very low cost compute and Model Based Reinforcement Learning make this modeling technique that much closer to adoption in the practical world. We keep asking if this is the year for reinforcement learning (RL) to finally make good on its many promises. Like flying cars and jet packs the answer always seems to be at least a couple of years away. If your history with data science goes back to late-aughts you may remember a time when there were only two basic types of models, supervised and unsupervised. Then, quite overnight, reinforcement learning was added as a third leg to this new stool.