Reinforcement Learning
Autonomous Driving using Safe Reinforcement Learning by Incorporating a Regret-based Human Lane-Changing Decision Model
Chen, Dong, Jiang, Longsheng, Wang, Yue, Li, Zhaojian
-- It is expected that many human drivers will still prefer to drive themselves even if the self-driving technologies are ready. T o enable A Vs to safely and efficiently maneuver in this mixed traffic, it is critical that the A Vs can understand how humans cope with risks and make driving-related decisions. On the other hand, the driving environment is highly dynamic and ever-changing, and it is thus difficult to enumerate all the scenarios and hard-code the controllers. T o face up these challenges, in this work, we incorporate a human decision-making model in reinforcement learning to control A Vs for safe and efficient operations. Specifically, we adapt regret theory to describe a human driver's lane-changing behavior, and fit the personalized models to individual drivers for predicting their lane-changing decisions. The predicted decisions are incorporated in the safety constraints for reinforcement learning in training and in implementation. We then use an extended version of double deep Q-network (DDQN) to train our A V controller within the safety set. By doing so, the amount of collisions in training is reduced to zero, while the training accuracy is not impinged. I. INTRODUCTION Autonomous driving has attracted significant research interest in the past two decades as it offers the potential to release drivers from exhausting driving. While great progresses have been made in the field of perception, path planning, and controls, high-level decision-making remains a big challenge due to the involvement of complex, cluttered environment and the dynamic, uncertain behaviors of other traffic users. Some recent works have been applying reinforcement learning (RL) methods to autonomous driving and promising performance [1] has been reported. RL-based methods can learn the decision-making and driving behaviors which are hard, if not infeasible, for traditional rule-based designs, and often with much less human effort. However, it is reported in [2] that when using RL-based methods lots of collisions happen before the agent starts to behave properly.
Efficient Intrinsically Motivated Robotic Grasping with Learning-Adaptive Imagination in Latent Space
Hafez, Muhammad Burhan, Weber, Cornelius, Kerzel, Matthias, Wermter, Stefan
Combining model-based and model-free deep reinforcement learning has shown great promise for improving sample efficiency on complex control tasks while still retaining high performance. Incorporating imagination is a recent effort in this direction inspired by human mental simulation of motor behavior. We propose a learning-adaptive imagination approach which, unlike previous approaches, takes into account the reliability of the learned dynamics model used for imagining the future. Our approach learns an ensemble of disjoint local dynamics models in latent space and derives an intrinsic reward based on learning progress, motivating the controller to take actions leading to data that improves the models. The learned models are used to generate imagined experiences, augmenting the training set of real experiences. We evaluate our approach on learning vision-based robotic grasping and show that it significantly improves sample efficiency and achieves near-optimal performance in a sparse reward environment.
RLCard: A Toolkit for Reinforcement Learning in Card Games
Zha, Daochen, Lai, Kwei-Herng, Cao, Yuanpu, Huang, Songyi, Wei, Ruzhe, Guo, Junyu, Hu, Xia
RLCard is an open-source toolkit for reinforcement learning research in card games. It supports various card environments with easy-to-use interfaces, including Blackjack, Leduc Hold'em, Texas Hold'em, UNO, Dou Dizhu and Mahjong. The goal of RLCard is to bridge reinforcement learning and imperfect information games, and push forward the research of reinforcement learning in domains with multiple agents, large state and action space, and sparse reward. In this paper, we provide an overview of the key components in RLCard, a discussion of the design principles, a brief introduction of the interfaces, and comprehensive evaluations of the environments.
Asking Easy Questions: A User-Friendly Approach to Active Reward Learning
Bıyık, Erdem, Palan, Malayandi, Landolfi, Nicholas C., Losey, Dylan P., Sadigh, Dorsa
Robots can learn the right reward function by querying a human expert. Existing approaches attempt to choose questions where the robot is most uncertain about the human's response; however, they do not consider how easy it will be for the human to answer! In this paper we explore an information gain formulation for optimally selecting questions that naturally account for the human's ability to answer. Our approach identifies questions that optimize the trade-off between robot and human uncertainty, and determines when these questions become redundant or costly. Simulations and a user study show our method not only produces easy questions, but also ultimately results in faster reward learning.
Towards Simplicity in Deep Reinforcement Learning: Streamlined Off-Policy Learning
Wang, Che, Wu, Yanqiu, Vuong, Quan, Ross, Keith
A BSTRACT The field of Deep Reinforcement Learning (DRL) has recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks. In this paper, we seek to understand the primary contribution of the entropy term to the performance of maximum entropy algorithms. For the Mujoco benchmark, we demonstrate that the entropy term in Soft Actor Critic (SAC) principally addresses the bounded nature of the action spaces. With this insight, we propose a simple normalization scheme which allows a streamlined algorithm without entropy maximization match the performance of SAC. Our experimental results demonstrate a need to revisit the benefits of entropy regularization in DRL. We also propose a simple nonuniform sampling method for selecting transitions from the replay buffer during training. We further show that the streamlined algorithm with the simple nonuniform sampling scheme outperforms SAC and achieves state-of-the-art performance on challenging continuous control tasks. 1 I NTRODUCTION Off-policy deep Reinforcement Learning (RL) algorithms aim to improve sample efficiency by reusing past experience. Recently a number of new off-policy Deep Reinforcement Learning algorithms have been proposed for control tasks with continuous state and action spaces, including Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3) (Lillicrap et al., 2015; Fuji-moto et al., 2018). TD3, in particular, has been shown to be significantly more sample efficient than popular on-policy methods for a wide range of Mujoco benchmarks. The field of Deep Reinforcement Learning (DRL) has also recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks.
Was it Worth Studying a Data Science Masters? - KDnuggets
Since completing my Masters in Data Science, I have had a number of people contact me asking for my experience with the course and whether it is worth recommending. Therefore, I thought it best to summarise my decision for starting the course, what I have achieved during my studies, and the outcome in the years following. It was the spring of 2016 and I was coming towards the end of a 6 month internship at one of the largest consulting firms in the City of London. I had taken this role to gain experience and figure out whether becoming an Actuary was the correct route for my career. I quickly found passion in the data analytics of the role as I was being pulled into meetings to discuss numbers I had crunched or was able hack together a tool to automate previously manual tasks.
Defensive Escort Teams via Multi-Agent Deep Reinforcement Learning
Garg, Arpit, Hasan, Yazied A., Yañez, Adam, Tapia, Lydia
-- Coordinated defensive escorts can aid a navigating payload by positioning themselves in order to maintain the safety of the payload from obstacles. In this paper, we present a novel, end-to-end solution for coordinating an escort team for protecting high-value payloads. Our solution employs deep reinforcement learning (RL) in order to train a team of escorts to maintain payload safety while navigating alongside the payload. This is done in a distributed fashion, relying only on limited range positional information of other escorts, the payload, and the obstacles. When compared to a state-of-art algorithm for obstacle avoidance, our solution with a single escort increases navigation success up to 31%. Additionally, escort teams increase success rate by up to 75% percent over escorts in static formations. We also show that this learned solution is general to several adaptations in the scenario including: a changing number of escorts in the team, changing obstacle density, and changes in payload conformation. Successful navigation in crowded scenarios often requires assuming a nonzero collision probability between the agent and stochastic obstacles [1]. This required assumption of risk is potentially frightening given the value of cargo that modern autonomous agents will be transporting, e.g., human life.
Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Sparse Reward Environments
Goecks, Vinicius G., Gremillion, Gregory M., Lawhern, Vernon J., Valasek, John, Waytowich, Nicholas R.
This paper investigates how to efficiently transition and update policies, trained initially with demonstrations, using off-policy actor-critic reinforcement learning. It is well-known that techniques based on Learning from Demonstrations, for example behavior cloning, can lead to proficient policies given limited data. However, it is currently unclear how to efficiently update that policy using reinforcement learning as these approaches are inherently optimizing different objective functions. Previous works have used loss functions which combine behavioral cloning losses with reinforcement learning losses to enable this update, however, the components of these loss functions are often set anecdotally, and their individual contributions are not well understood. In this work we propose the Cycle-of-Learning (CoL) framework that uses an actor-critic architecture with a loss function that combines behavior cloning and 1-step Q-learning losses with an off-policy pre-training step from human demonstrations. This enables transition from behavior cloning to reinforcement learning without performance degradation and improves reinforcement learning in terms of overall performance and training time. Additionally, we carefully study the composition of these combined losses and their impact on overall policy learning. We show that our approach outperforms state-of-the-art techniques for combining behavior cloning and reinforcement learning for both dense and sparse reward scenarios. Our results also suggest that directly including the behavior cloning loss on demonstration data helps to ensure stable learning and ground future policy updates.
CAQL: Continuous Action Q-Learning
Ryu, Moonkyung, Chow, Yinlam, Anderson, Ross, Tjandraatmadja, Christian, Boutilier, Craig
A BSTRACT V alue-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization ( max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plug-and-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP) . When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically. When the action space is finite, value-based algorithms such as Q-learning (Watkins & Dayan, 1992), which implicitly finds a policy by learning the optimal value function, are often very efficient because action optimization can be done by exhaustive enumeration. By contrast, in problems with a continuous action spaces (e.g., robotics (Peters & Schaal, 2006)), policy-based algorithms, such as policy gradient (PG) (Sutton et al., 2000; Silver et al., 2014) or cross-entropy policy search (CEPS) (Mannor et al., 2003; Kalashnikov et al., 2018), which directly learn a return-maximizing policy, have proven more practical. Recently, methods such as ensemble critic (Fujimoto et al., 2018) and entropy regularization (Haarnoja et al., 2018) have been developed to improve the performance of policy-based RL algorithms. Policy-based approaches require a reasonable choice of policy parameterization. In some continuous control problems, Gaussian distributions over actions conditioned on some state representation is used. However, in applications such as RSs, where actions often take the form of high-dimensional item-feature vectors, policies cannot typically be modeled by common action distributions.
Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models
Byravan, Arunkumar, Springenberg, Jost Tobias, Abdolmaleki, Abbas, Hafner, Roland, Neunert, Michael, Lampe, Thomas, Siegel, Noah, Heess, Nicolas, Riedmiller, Martin
Humans are masters at quickly learning many complex tasks, relying on an approximate understanding of the dynamics of their environments. In much the same way, we would like our learning agents to quickly adapt to new tasks. In this paper, we explore how model-based Reinforcement Learning (RL) can facilitate transfer to new tasks. We develop an algorithm that learns an action-conditional, predictive model of expected future observations, rewards and values from which a policy can be derived by following the gradient of the estimated value along imagined trajectories. We show how robust policy optimization can be achieved in robot manipulation tasks even with approximate models that are learned directly from vision and proprioception. We evaluate the efficacy of our approach in a transfer learning scenario, re-using previously learned models on tasks with different reward structures and visual distractors, and show a significant improvement in learning speed compared to strong off-policy baselines. Videos with results can be found at https://sites.google.com/view/ivg-corl19