Reinforcement Learning
Decentralized Deep Reinforcement Learning for Network Level Traffic Signal Control
In this thesis, I propose a family of fully decentralized deep multi-agent reinforcement learning (MARL) algorithms to achieve high, real-time performance in network-level traffic signal control. In this approach, each intersection is modeled as an agent that plays a Markovian Game against the other intersection nodes in a traffic signal network modeled as an undirected graph, to approach the optimal reduction in delay. Following Partially Observable Markov Decision Processes (POMDPs), there are 3 levels of communication schemes between adjacent learning agents: independent deep Q-leaning (IDQL), shared states reinforcement learning (S2RL) and a shared states & rewards version of S2RL--S2R2L. In these 3 variants of decentralized MARL schemes, individual agent trains its local deep Q network (DQN) separately, enhanced by convergence-guaranteed techniques like double DQN, prioritized experience replay, multi-step bootstrapping, etc. To test the performance of the proposed three MARL algorithms, a SUMO-based simulation platform is developed to mimic the traffic evolution of the real world. Fed with random traffic demand between permitted OD pairs, a 4x4 Manhattan-style grid network is set up as the testbed, two different vehicle arrival rates are generated for model training and testing. The experiment results show that S2R2L has a quicker convergence rate and better convergent performance than IDQL and S2RL in the training process. Moreover, three MARL schemes all reveal exceptional generalization abilities. Their testing results surpass the benchmark Max Pressure (MP) algorithm, under the criteria of average vehicle delay, network-level queue length and fuel consumption rate. Notably, S2R2L has the best testing performance of reducing 34.55% traffic delay and dissipating 10.91% queue length compared with MP.
Hyperparameter Selection for Offline Reinforcement Learning
Paine, Tom Le, Paduraru, Cosmin, Michi, Andrea, Gulcehre, Caglar, Zolna, Konrad, Novikov, Alexander, Wang, Ziyu, de Freitas, Nando
Offline reinforcement learning (RL purely from logged data) is an important avenue for deploying RL techniques in real-world scenarios. However, existing hyperparameter selection methods for offline RL break the offline assumption by evaluating policies corresponding to each hyperparameter setting in the environment. This online execution is often infeasible and hence undermines the main aim of offline RL. Therefore, in this work, we focus on \textit{offline hyperparameter selection}, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data. Through large-scale empirical evaluation we show that: 1) offline RL algorithms are not robust to hyperparameter choices, 2) factors such as the offline RL algorithm and method for estimating Q values can have a big impact on hyperparameter selection, and 3) when we control those factors carefully, we can reliably rank policies across hyperparameter choices, and therefore choose policies which are close to the best policy in the set. Overall, our results present an optimistic view that offline hyperparameter selection is within reach, even in challenging tasks with pixel observations, high dimensional action spaces, and long horizon.
Modulation of viability signals for self-regulatory control
Ovalle, Alvaro, Lucas, Simon M.
We revisit the role of instrumental value as a driver of adaptive behavior. In active inference, instrumental or extrinsic value is quantified by the information-theoretic surprisal of a set of observations measuring the extent to which those observations conform to prior beliefs or preferences. That is, an agent is expected to seek the type of evidence that is consistent with its own model of the world. For reinforcement learning tasks, the distribution of preferences replaces the notion of reward. We explore a scenario in which the agent learns this distribution in a self-supervised manner. In particular, we highlight the distinction between observations induced by the environment and those pertaining more directly to the continuity of an agent in time. We evaluate our methodology in a dynamic environment with discrete time and actions. First with a surprisal minimizing model-free agent (in the RL sense) and then expanding to the model-based case to minimize the expected free energy.
Discovering Reinforcement Learning Algorithms
Oh, Junhyuk, Hessel, Matteo, Czarnecki, Wojciech M., Xu, Zhongwen, van Hasselt, Hado, Singh, Satinder, Silver, David
Reinforcement learning (RL) algorithms update an agent's parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have been prior attempts at addressing this significant scientific challenge, it remains an open question whether it is feasible to discover alternatives to fundamental concepts of RL such as value functions and temporal-difference learning. This paper introduces a new meta-learning approach that discovers an entire update rule which includes both 'what to predict' (e.g. value functions) and 'how to learn from it' (e.g. bootstrapping) by interacting with a set of environments. The output of this method is an RL algorithm that we call Learned Policy Gradient (LPG). Empirical results show that our method discovers its own alternative to the concept of value functions. Furthermore it discovers a bootstrapping mechanism to maintain and use its predictions. Surprisingly, when trained solely on toy environments, LPG generalises effectively to complex Atari games and achieves non-trivial performance. This shows the potential to discover general RL algorithms from data.
PAC Bounds for Imitation and Model-based Batch Learning of Contextual Markov Decision Processes
Nair, Yash, Doshi-Velez, Finale
We consider the problem of batch multi-task reinforcement learning with observed context descriptors, motivated by its application to personalized medical treatment. In particular, we study two general classes of learning algorithms: direct policy learning (DPL), an imitation-learning based approach which learns from expert trajectories, and model-based learning. First, we derive sample complexity bounds for DPL, and then show that model-based learning from expert actions can, even with a finite model class, be impossible. After relaxing the conditions under which the model-based approach is expected to learn by allowing for greater coverage of state-action space, we provide sample complexity bounds for model-based learning with finite model classes, showing that there exist model classes with sample complexity exponential in their statistical complexity. We then derive a sample complexity upper bound for model-based learning based on a measure of concentration of the data distribution. Our results give formal justification for imitation learning over model-based learning in this setting.
Hierarchical Deep Reinforcement Learning Approach for Multi-Objective Scheduling With Varying Queue Sizes
Birman, Yoni, Ido, Ziv, Katz, Gilad, Shabtai, Asaf
Multi-objective task scheduling (MOTS) is the task scheduling while optimizing multiple and possibly contradicting constraints. A challenging extension of this problem occurs when every individual task is a multi-objective optimization problem by itself. While deep reinforcement learning (DRL) has been successfully applied to complex sequential problems, its application to the MOTS domain has been stymied by two challenges. The first challenge is the inability of the DRL algorithm to ensure that every item is processed identically regardless of its position in the queue. The second challenge is the need to manage large queues, which results in large neural architectures and long training times. In this study we present MERLIN, a robust, modular and near-optimal DRL-based approach for multi-objective task scheduling. MERLIN applies a hierarchical approach to the MOTS problem by creating one neural network for the processing of individual tasks and another for the scheduling of the overall queue. In addition to being smaller and with shorted training times, the resulting architecture ensures that an item is processed in the same manner regardless of its position in the queue. Additionally, we present a novel approach for efficiently applying DRL-based solutions on very large queues, and demonstrate how we effectively scale MERLIN to process queue sizes that are larger by orders of magnitude than those on which it was trained. Extensive evaluation on multiple queue sizes show that MERLIN outperforms multiple well-known baselines by a large margin (>22%).
Python For Beginners Part-1
Udemy Coupon - Python For Beginners Part-1, Beginner to Expert Python.Start from the basics and go all the way to creating your own applications and games! New Created by Suraj Nimbalkar English [Auto]00 Students also bought Advanced AI: Deep Reinforcement Learning in Python ayesian Machine Learning in Python: A/B Testing 2020 Complete Python Bootcamp: From Zero to Hero in Python Python and Django Full Stack Web Developer Bootcamp ython A-Z: Python For Data Science With Real Exercises! Learn Python & Ethical Hacking From Scratch Preview this Course GET COUPON CODE Description Learn Python From Scratch I've created thorough, extensive, but easy to follow content which you'll easily understand and absorb. The course starts with the basics, including Python fundamentals, programming, and user interaction. The curriculum is going to be very hands-on as we walk you from start to finish becoming a professional Python developer.
Collision Avoidance Robotics Via Meta-Learning (CARML)
Iyer, Abhiram, Mahadevan, Aravind
Inspired by the work done by Andrychowicz et al. in [7], they modeled an I. INTRODUCTION LSTM as a meta-learner, which helped to train another neural Today, most deep reinforcement learning techniques require network "learner" classifier using a few-shot framework. Unlike models to be trained on a large number of training samples. In common deep learning optimizers such as Momentum, contrast, Model-Agnostic Meta-Learning (MAML) proposed ADAM, and Adagrad, this method is able to train a model by Finn et.
Explanation Augmented Feedback in Human-in-the-Loop Reinforcement Learning
Guan, Lin, Verma, Mudit, Kambhampati, Subbarao
Human-in-the-loop Reinforcement Learning (HRL) aims to integrate human guidance with Reinforcement Learning (RL) algorithms to improve sample efficiency and performance. The usual human guidance in HRL is binary evaluative "good" or "bad" signal for queried states and actions. However, this suffers from the problems of weak supervision and poor efficiency in leveraging human feedback. To address this, we present EXPAND (Explanation Augmented Feedback) which allows for explanatory information to be given as saliency maps from the human in addition to the binary feedback. EXPAND employs a state perturbation approach based on the state salient information to augment the feedback, reducing the number of human feedback signals required. We choose two domains to evaluate this approach, Taxi and Atari-Pong. We demonstrate the effectiveness of our method on three metrics, environment sample efficiency, human feedback sample efficiency, and agent gaze. We show that our method outperforms our baselines. Finally, we present an ablation study to confirm our hypothesis that augmenting binary feedback with state salient information gives a boost in performance.