Reinforcement Learning
Learn Types of Machine Learning Algorithms with Ultimate Use Cases - DataFlair
In this article, we will study the various types of machine learning algorithms and their use-cases. We will study how Baidu is using supervised learning-based facial recognition for intelligent airport check-in and how Google is making use of Reinforcement Learning to develop an intelligent platform that would answer your queries. Machine Learning is a broad field, but it is classified into three classes of supervised, unsupervised and reinforcement learning. All these three paradigms are used everywhere to power intelligent applications. We will look at the important use cases of these paradigms and how they are revolutionizing our world today.
Reinforcement Learning -- Policy Approximation
Till now, all algorithms being introduced are either value function or Q function based gradient algorithm, that is we assume there exists a true value V(or Q) for different state S(or [S, A]), and to approach the true value we use gradient method that comes with either V or Q in the formula, and and the end of the learning process, a policy π(A S) is generated by choosing the most rewarding action at each state based on V or Q function estimation. However, policy gradient method proposes a total different view on reinforcement learning problems, instead of learning a value function, one can directly learn or update a policy. Remember in previous posts, the policy being used in the learning process is always ϵ-greedy, which means the agent will take random action will a certain probability and take greedy action in the rest. However, in gradient policy method, the problem is formulated as, P(A S, θ) π(A S, θ), which is saying, for each state, the policy gives a probability of each action possible taken from that state, and in order to optimise the policy, it is parameterised with θ (similar to weight parameter w in value function we introduced before). And because of J is a representation of policy π, we know that the update of θ will include the current policy, and after a series of deduction(for details, please refer to Sutton's book, chapter 13), we get the update process: G is still the cumulative discounted reward, and the parameter θ will be updated with current derivative of policy.
Flow: A Modular Learning Framework for Autonomy in Traffic
Wu, Cathy, Kreidieh, Aboudy, Parvate, Kanaad, Vinitsky, Eugene, Bayen, Alexandre M
The rapid development of autonomous vehicles (AVs) holds vast potential for transportation systems through improved safety, efficiency, and access to mobility. However, due to numerous technical, political, and human factors challenges, new methodologies are needed to design vehicles and transportation systems for these positive outcomes. This article tackles important technical challenges arising from the partial adoption of autonomy (hence termed mixed autonomy, to involve both AVs and human-driven vehicles): partial control, partial observation, complex multi-vehicle interactions, and the sheer variety of traffic settings represented by real-world networks. To enable the study of the full diversity of traffic settings, we first propose to decompose traffic control tasks into modules, which may be configured and composed to create new control tasks of interest. These modules include salient aspects of traffic control tasks: networks, actors, control laws, metrics, initialization, and additional dynamics. Second, we study the potential of model-free deep Reinforcement Learning (RL) methods to address the complexity of traffic dynamics. The resulting modular learning framework is called Flow. Using Flow, we create and study a variety of mixed-autonomy settings, including single-lane, multi-lane, and intersection traffic. In all cases, the learned control law exceeds human driving performance (measured by system-level velocity) by at least 40% with only 5-10% adoption of AVs. In the case of partially-observed single-lane traffic, we show that a low-parameter neural network control law can eliminate commonly observed stop-and-go traffic. In particular, the control laws surpass all known model-based controllers, achieving near-optimal performance across a wide spectrum of vehicle densities (even with a memoryless control law) and generalizing to out-of-distribution vehicle densities.
Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping
Bodnar, Cristian, Li, Adrian, Hausman, Karol, Pastor, Peter, Kalakrishnan, Mrinal
Quantile QT -Opt for Risk-A ware Vision-Based Robotic Grasping Cristian Bodnar 1, Adrian Li 2, Karol Hausman 3, Peter Pastor 2, Mrinal Kalakrishnan 2 Abstract -- The distributional perspective on reinforcement learning (RL) has given rise to a series of successful Q-learning algorithms, resulting in state-of-the-art performance in arcade game environments. However, it has not yet been analyzed how these findings from a discrete setting translate to complex practical applications characterized by noisy, high dimensional and continuous state-action spaces. In this work, we propose Quantile QT -Opt (Q2-Opt), a distributional variant of the recently introduced distributed Q-learning algorithm [11] for continuous domains, and examine its behaviour in a series of simulated and real vision-based robotic grasping tasks. The absence of an actor in Q2-Opt allows us to directly draw a parallel to the previous discrete experiments in the literature without the additional complexities induced by an actor-critic architecture. We demonstrate that Q2-Opt achieves a superior vision-based object grasping success rate, while also being more sample efficient. The distributional formulation also allows us to experiment with various risk-distortion metrics that give us an indication of how robots can concretely manage risk in practice using a Deep RL control policy. As an additional contribution, we perform experiments on offline datasets and compare them with the latest findings from discrete settings. Surprisingly, we find that there is a discrepancy between our results and the previous batch RL findings from the literature obtained on arcade game environments. I. INTRODUCTION The new distributional perspective on RL has produced a novel class of Deep Q-learning methods that learn a distribution over the state-action returns, instead of using the expectation given by the traditional value function.
Augmenting learning using symmetry in a biologically-inspired domain
Mishra, Shruti, Abdolmaleki, Abbas, Guez, Arthur, Trochim, Piotr, Precup, Doina
Invariances to translation, rotation and other spatial transformations are a hallmark of the laws of motion, and have widespread use in the natural sciences to reduce the dimensionality of systems of equations. In supervised learning, such as in image classification tasks, rotation, translation and scale invariances are used to augment training datasets. In this work, we use data augmentation in a similar way, exploiting symmetry in the quadruped domain of the DeepMind control suite (Tassa et al. 2018) to add to the trajectories experienced by the actor in the actor-critic algorithm of Abdolmaleki et al. (2018). In a data-limited regime, the agent using a set of experiences augmented through symmetry is able to learn faster. Our approach can be used to inject knowledge of invariances in the domain and task to augment learning in robots, and more generally, to speed up learning in realistic robotics applications.
Reinforcement Learning for Multi-Objective Optimization of Online Decisions in High-Dimensional Systems
Meisheri, Hardik, Baniwal, Vinita, Sultana, Nazneen N, Ravindran, Balaraman, Khadilkar, Harshad
This paper describes a purely data-driven solution to a class of sequential decision-making problems with a large number of concurrent online decisions, with applications to computing systems and operations research. We assume that while the micro-level behaviour of the system can be broadly captured by analytical expressions or simulation, the macro-level or emergent behaviour is complicated by non-linearity, constraints, and stochasticity. If we represent the set of concurrent decisions to be computed as a vector, each element of the vector is assumed to be a continuous variable, and the number of such elements is arbitrarily large and variable from one problem instance to another. We first formulate the decision-making problem as a canonical reinforcement learning (RL) problem, which can be solved using purely data-driven techniques. We modify a standard approach known as advantage actor critic (A2C) to ensure its suitability to the problem at hand, and compare its performance to that of baseline approaches on the specific instance of a multi-product inventory management task. The key modifications include a parallelised formulation of the decision-making task, and a training procedure that explicitly recognises the quantitative relationship between different decisions. We also present experimental results probing the learned policies, and their robustness to variations in the data.
Meta-Q-Learning
Fakoor, Rasool, Chaudhari, Pratik, Soatto, Stefano, Smola, Alexander J.
This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state of the art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, using a multi-task objective to maximize the average reward across the training tasks is an effective method to meta-train RL policies. Third, past data from the meta-training replay buffer can be recycled to adapt the policy on a new task using off-policy updates. MQL draws upon ideas in propensity estimation to do so and thereby amplifies the amount of available data for adaptation. Experiments on standard continuous-control benchmarks suggest that MQL compares favorably with state of the art meta-RL algorithms.
RLCache: Automated Cache Management Using Reinforcement Learning
This study investigates the use of reinforcement learning to guide a general purpose cache manager decisions. Cache managers directly impact the overall performance of computer systems. They govern decisions about which objects should be cached, the duration they should be cached for, and decides on which objects to evict from the cache if it is full. These three decisions impact both the cache hit rate and size of the storage that is needed to achieve that cache hit rate. An optimal cache manager will avoid unnecessary operations, maximise the cache hit rate which results in fewer round trips to a slower backend storage system, and minimise the size of storage needed to achieve a high hit-rate. This project investigates using reinforcement learning in cache management by designing three separate agents for each of the cache manager tasks. Furthermore, the project investigates two advanced reinforcement learning architectures for multi-decision problems: a single multi-task agent and a multi-agent. We also introduce a framework to simplify the modelling of computer systems problems as a reinforcement learning task. The framework abstracts delayed experiences observations and reward assignment in computer systems while providing a flexible way to scale to multiple agents. Simulation results based on an established database benchmark system show that reinforcement learning agents can achieve a higher cache hit rate over heuristic driven algorithms while minimising the needed space. They are also able to adapt to a changing workload and dynamically adjust their caching strategy accordingly. The proposed cache manager model is generic and applicable to other types of caches, such as file system caches. This project is the first, to our knowledge, to model cache manager decisions as a multi-task control problem.
Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving
Huegle, Maria, Kalweit, Gabriel, Werling, Moritz, Boedecker, Joschka
Dynamic Interaction-A ware Scene Understanding for Reinforcement Learning in Autonomous Driving Maria Huegle 1, Gabriel Kalweit 1, Moritz Werling 2 and Joschka Boedecker 1, 3 Abstract -- The common pipeline in autonomous driving systems is highly modular and includes a perception component which extracts lists of surrounding objects and passes these lists to a high-level decision component. In this case, leveraging the benefits of deep reinforcement learning for high-level decision making requires special architectures to deal with multiple variable-length sequences of different object types, such as vehicles, lanes or traffic signs. At the same time, the architecture has to be able to cover interactions between traffic participants in order to find the optimal action to be taken. In this work, we propose the novel Deep Scenes architecture, that can learn complex interaction-aware scene representations based on extensions of either 1) Deep Sets or 2) Graph Convolutional Networks. We present the Graph-Q and DeepScene-Q off-policy reinforcement learning algorithms, both outperforming state-of- the-art methods in evaluations with the publicly available traffic simulator SUMO. I. INTRODUCTION In autonomous driving scenarios, the number of traffic participants and lanes surrounding the agent can vary considerably over time. Common autonomous driving systems use modular pipelines, where a perception component extracts a list of surrounding objects and passes this list to other modules, including localization, mapping, motion planning and high-level decision making components.
Off-policy Multi-step Q-learning
Kalweit, Gabriel, Huegle, Maria, Boedecker, Joschka
In the past few years, off-policy reinforcement learning methods have shown promising results in their application for robot control. Deep Q-learning, however, still suffers from poor data-efficiency which is limiting with regard to real-world applications. We follow the idea of multi-step TD-learning to enhance data-efficiency while remaining off-policy by proposing two novel Temporal-Difference formulations: (1) Truncated Q-functions which represent the return for the first n steps of a policy rollout and (2) Shifted Q-functions, acting as the farsighted return after this truncated rollout. We prove that the combination of these short-and long-term predictions is a representation of the full return, leading to the Composite Q-learning algorithm. We show the efficacy of Composite Q-learning in the tabular case and compare our approach in the function-approximation setting with TD3, Model-based V alue Expansion and TD3(), which we introduce as an off-policy variant of TD(). We show on three simulated robot tasks that Composite TD3 outperforms TD3 as well as state-of-the-art off-policy multi-step approaches in terms of data-efficiency. In recent years, Q-learning (Watkins and Dayan, 1992) has achieved major successes in a broad range of areas by employing deep neural networks (Mnih et al., 2015; Silver et al., 2018; Lillicrap et al., 2016), including environments of higher complexity (Riedmiller et al., 2018) and even in first real world applications (Haarnoja et al., 2019). Due to its off-policy update, Q-learning can leverage transitions collected by any policy which makes it more data-efficient compared to on-policy methods.