This course is about Reinforcement Learning. The first step is to talk about the mathematic. This course is about Reinforcement Learning. The first step is to talk about the mathematical background: we can use a Markov Decision Process as a model for reinforcement learning. We can solve the problem 3 ways: value-iteration, policy-iteration and Q-learning.
Abstract-- We propose a hybrid approach aimed at improving thesample efficiency in goal-directed reinforcement learning. We do this via a two-step mechanism where firstly, we approximate a model from Model-Free reinforcement learning. Then, we leverage this approximate model along with a notion of reachability using Mean First Passage Times to perform Model-Based reinforcement learning. Built on such a novel observation, we design two new algorithms - Mean First Passage Time based Q-Learning (MFPT-Q) and Mean First Passage Time based DYNA (MFPT-DYNA), that have been fundamentally modified from the state-of-the-art reinforcement learning techniques. Preliminary results have shown that our hybrid approaches converge with much fewer iterations than their corresponding state-of-the-art counterparts and therefore requiring much fewer samples and much fewer training trials to converge. I. INTRODUCTION Reinforcement Learning (RL) has been successfully applied to numerous challenging problems for autonomous agents to behave intelligently in unstructured real-world environment. One interesting area of research in RL which motivates this work is goal-directed reinforcement learning problem (GDRLP)  . In GDRLP, the learning process takes place in two stages.
Generalization to unknown/uncertain environments of reinforcement learning algorithms is crucial for real-world applications. In this work, we explicitly consider uncertainty associated with the test environment through an uncertainty set. We formulate the Distributionally Robust Reinforcement Learning (DR-RL) objective that consists in maximizing performance against a worst-case policy in uncertainty set centered at the reference policy. Based on this objective, we derive computationally efficient policy improvement algorithm that benefits from Distributionally Robust Optimization (DRO) guarantees. Further, we propose an iterative procedure that increases stability of learning, called Distributionally Robust Policy Iteration. Combined with maximum entropy framework, we derive a distributionally robust variant of Soft Q-learning that enjoys efficient practical implementation and produces policies with robust behaviour at test time. Our formulation provides a unified view on a number of safe RL algorithms and recent empirical successes.
Niu, Sufeng (Clemson University) | Chen, Siheng (Uber Advanced Technologies Group ) | Guo, Hanyu (Clemson University) | Targonski, Colin (Clemson University) | Smith, Melissa C. (Clemson University) | Kovačević, Jelena (Carnegie Mellon University)
In this paper, we introduce a generalized value iteration network (GVIN), which is an end-to-end neural network planning module. GVIN emulates the value iteration algorithm by using a novel graph convolution operator, which enables GVIN to learn and plan on irregular spatial graphs. We propose three novel differentiable kernels as graph convolution operators and show that the embedding-based kernel achieves the best performance. Furthermore, we present episodic Q-learning, an improvement upon traditional n-step Q-learning that stabilizes training for VIN and GVIN. Lastly, we evaluate GVIN on planning problems in 2D mazes, irregular graphs, and real-world street networks, showing that GVIN generalizes well for both arbitrary graphs and unseen graphs of larger scaleand outperforms a naive generalization of VIN (discretizing a spatial graph into a 2D image).