Reinforcement Learning
Hierarchical Reinforcement Learning with Hindsight
Levy, Andrew, Platt, Robert, Saenko, Kate
Reinforcement Learning (RL) algorithms can suffer from poor sample efficiency when rewards are delayed and sparse. We introduce a solution that enables agents to learn temporally extended actions at multiple levels of abstraction in a sample efficient and automated fashion. Our approach combines universal value functions and hindsight learning, allowing agents to learn policies belonging to different time scales in parallel. We show that our method significantly accelerates learning in a variety of discrete and continuous tasks.
Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning
Efroni, Yonathan, Dalal, Gal, Scherrer, Bruno, Mannor, Shie
Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.
Do deep reinforcement learning agents model intentions?
Matiisen, Tambet, Labash, Aqeel, Majoral, Daniel, Aru, Jaan, Vicente, Raul
Inferring other agents' mental states such as their knowledge, beliefs and intentions is thought to be essential for effective interactions with other agents. Recently, multiagent systems trained via deep reinforcement learning have been shown to succeed in solving different tasks, but it remains unclear how each agent modeled or represented other agents in their environment. In this work we test whether deep reinforcement learning agents explicitly represent other agents' intentions (their specific aims or goals) during a task in which the agents had to coordinate the covering of different spots in a 2D environment. In particular, we tracked over time the performance of a linear decoder trained to predict the final goal of all agents from the hidden state of each agent's neural network controller. We observed that the hidden layers of agents represented explicit information about other agents' goals, i.e. the target landmark they ended up covering. We also performed a series of experiments, in which some agents were replaced by others with fixed goals, to test the level of generalization of the trained agents. We noticed that during the training phase the agents developed a differential preference for each goal, which hindered generalization. To alleviate the above problem, we propose simple changes to the MADDPG training algorithm which leads to better generalization against unseen agents. We believe that training protocols promoting more active intention reading mechanisms, e.g. by preventing simple symmetry-breaking solutions, is a promising direction towards achieving a more robust generalization in different cooperative and competitive tasks.
[D] What are some non-trivial but achievable exercises in reinforcement learning? โข r/MachineLearning
It's a lot more involved, but the Easy21 assignment from David Silver's RL course is a very thorough intro to RL - it does require going through requisite material in the course though. This is in opposition to putting a cross-entropy loss on a CNN and training it on MNIST - courses/practicals will vary on how practically they deal with this task vs. how much they go into the underlying theory.
Rebalancing Dockless Bike Sharing Systems
Pan, Ling, Cai, Qingpeng, Fang, Zhixuan, Tang, Pingzhong, Huang, Longbo
Bike sharing provides an environment-friendly way for traveling and is booming worldwide. Yet, due to the high similarity of user travel patterns, the bike imbalance problem constantly occurs, especially for dockless bike sharing systems, causing significant impact on service quality and company revenue. Thus, it has become a critical task for bike sharing systems to resolve such imbalance efficiently. In this paper, we propose a novel deep reinforcement learning framework for incentivizing users to rebalance such sys- tems. We model this problem as a Markov decision process and take both spatial and temporal features into consideration. We develop a novel deep reinforcement learning algorithm called Hierarchical Reinforcement Pricing (HRP), which builds upon the Deep Deterministic Policy Gradient algorithm. Different from existing methods that often ignore spatial information and rely heavily on accurate prediction, HRP can capture both spatial and temporal dependencies using a divide-and-conquer structure with an embedded localized module. We conduct extensive experiments to evaluate HRP, based on a dataset from Mobike, a major Chinese dockless bike sharing company. Results show that HRP performs close to the 24-timeslot look-ahead optimization, and outperforms state-of-the-art methods in both service level and bike distribution. It also transfers well when applied to unseen areas.
Automata-Guided Hierarchical Reinforcement Learning for Skill Composition
Li, Xiao, Ma, Yao, Belta, Calin
Skills learned through (deep) reinforcement learning often generalizes poorly across domains and re-training is necessary when presented with a new task. We present a framework that combines techniques in \textit{formal methods} with \textit{reinforcement learning} (RL). The methods we provide allows for convenient specification of tasks with logical expressions, learns hierarchical policies (meta-controller and low-level controllers) with well-defined intrinsic rewards, and construct new skills from existing ones with little to no additional exploration. We evaluate the proposed methods in a simple grid world simulation as well as a more complicated kitchen environment in AI2Thor
Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning
Long, Pinxin, Fan, Tingxiang, Liao, Xinyi, Liu, Wenxi, Zhang, Hao, Pan, Jia
Developing a safe and efficient collision avoidance policy for multiple robots is challenging in the decentralized scenarios where each robot generate its paths without observing other robots' states and intents. While other distributed multi-robot collision avoidance systems exist, they often require extracting agent-level features to plan a local collision-free action, which can be computationally prohibitive and not robust. More importantly, in practice the performance of these methods are much lower than their centralized counterparts. We present a decentralized sensor-level collision avoidance policy for multi-robot systems, which directly maps raw sensor measurements to an agent's steering commands in terms of movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we present a multi-scenario multi-stage training framework to find an optimal policy which is trained over a large number of robots on rich, complex environments simultaneously using a policy gradient based reinforcement learning algorithm. We validate the learned sensor-level collision avoidance policy in a variety of simulated scenarios with thorough performance evaluations and show that the final learned policy is able to find time efficient, collision-free paths for a large-scale robot system. We also demonstrate that the learned policy can be well generalized to new scenarios that do not appear in the entire training period, including navigating a heterogeneous group of robots and a large-scale scenario with 100 robots. Videos are available at https://sites.google.com/view/drlmaca
Learning Real-World Robot Policies by Dreaming
Piergiovanni, AJ, Wu, Alan, Ryoo, Michael S.
Learning to control robots directly based on images is a primary challenge in robotics. However, many existing reinforcement learning approaches require iteratively obtaining millions of samples to learn a policy which can take significant time. In this paper, we focus on the problem of learning real-world robot action policies solely based on a few random off-policy samples. We learn a realistic dreaming model that can emulate samples equivalent to a sequence of images from the actual environment, and make the agent learn action policies by interacting with the dreaming model rather than the real world. We experimentally confirm that our dreaming model can learn realistic policies that transfer to the real-world.
A Framework and Method for Online Inverse Reinforcement Learning
Arora, Saurabh, Doshi, Prashant, Banerjee, Bikramjit
Inverse reinforcement learning (IRL) is the problem of learning the preferences of an agent from the observations of its behavior on a task. While this problem has been well investigated, the related problem of {\em online} IRL---where the observations are incrementally accrued, yet the demands of the application often prohibit a full rerun of an IRL method---has received relatively less attention. We introduce the first formal framework for online IRL, called incremental IRL (I2RL), and a new method that advances maximum entropy IRL with hidden variables, to this setting. Our formal analysis shows that the new method has a monotonically improving performance with more demonstration data, as well as probabilistically bounded error, both under full and partial observability. Experiments in a simulated robotic application of penetrating a continuous patrol under occlusion shows the relatively improved performance and speed up of the new method and validates the utility of online IRL.
Safe Policy Learning from Observations
Sarafian, Elad, Tamar, Aviv, Kraus, Sarit
In this paper, we consider the problem of learning a policy by observing numerous non-expert agents. Our goal is to extract a policy that, with high-confidence, acts better than the average agents' performance. Such a setting is important for real-world problems where expert data is scarce but non-expert data can easily be obtained, e.g. by crowdsourcing. Our approach is to pose this problem as safe policy improvement in Reinforcement Learning. First, we evaluate an average behavior policy and approximate its value function. Then, we develop a stochastic policy improvement algorithm, termed Rerouted Behavior Improvement (RBI), that safely improves the average behavior. The primary advantages of RBI over current safe learning methods are its stability in the presence of value estimation errors and the elimination of a policy search process. We demonstrate these advantages in a Taxi grid-world domain and in four games from the Atari learning environment.