Reinforcement Learning
On the Convergence of Consensus Algorithms with Markovian Noise and Gradient Bias
This paper presents a finite time convergence analysis for a decentralized stochastic approximation (SA) scheme. The scheme generalizes several algorithms for decentralized machine learning and multi-agent reinforcement learning. Our proof technique involves separating the iterates into their respective consensual parts and consensus error. The consensus error is bounded in terms of the stationarity of the consensual part, while the updates of the consensual part can be analyzed as a perturbed SA scheme. Under the Markovian noise and time varying communication graph assumptions, the decentralized SA scheme has an expected convergence rate of ${\cal O}(\log T/ \sqrt{T} )$, where $T$ is the iteration number, in terms of squared norms of gradient for nonlinear SA with smooth but non-convex cost function. This rate is comparable to the best known performances of SA in a centralized setting with a non-convex potential function.
LBGP: Learning Based Goal Planning for Autonomous Following in Front
Nikdel, Payam, Vaughan, Richard, Chen, Mo
This paper investigates a hybrid solution which combines deep reinforcement learning (RL) and classical trajectory planning for the following in front application. Here, an autonomous robot aims to stay ahead of a person as the person freely walks around. Following in front is a challenging problem as the user's intended trajectory is unknown and needs to be estimated, explicitly or implicitly, by the robot. In addition, the robot needs to find a feasible way to safely navigate ahead of human trajectory. Our deep RL module implicitly estimates human trajectory and produces short-term navigational goals to guide the robot. These goals are used by a trajectory planner to smoothly navigate the robot to the short-term goals, and eventually in front of the user. We employ curriculum learning in the deep RL module to efficiently achieve a high return. Our system outperforms the state-of-the-art in following ahead and is more reliable compared to end-to-end alternatives in both the simulation and real world experiments. In contrast to a pure deep RL approach, we demonstrate zero-shot transfer of the trained policy from simulation to the real world.
RealAnt: An Open-Source Low-Cost Quadruped for Research in Real-World Reinforcement Learning
Boney, Rinu, Sainio, Jussi, Kaivola, Mikko, Solin, Arno, Kannala, Juho
Abstract-- Current robot platforms available for research are either very expensive or unable to handle the abuse of exploratory controls in reinforcement learning. We develop RealAnt, a minimal low-cost physical version of the popular'Ant' benchmark used in reinforcement learning. RealAnt costs only 350 AC ($410) in materials and can be assembled in less than an hour. We validate the platform with reinforcement learning experiments and provide baseline results on a set of benchmark tasks. We demonstrate that the TD3 algorithm can learn to walk the RealAnt from less than 45 minutes of experience. We also provide simulator versions of the robot (with the same dimensions, state-action spaces, and delayed noisy observations) in the MuJoCo and PyBullet simulators.
Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping
Hu, Yujing, Wang, Weixun, Jia, Hangtian, Wang, Yixiang, Chen, Yingfeng, Hao, Jianye, Wu, Feng, Fan, Changjie
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL). Existing approaches such as potential-based reward shaping normally make full use of a given shaping reward function. However, since the transformation of human knowledge into numeric reward values is often imperfect due to reasons such as human cognitive bias, completely utilizing the shaping reward function may fail to improve the performance of RL algorithms. In this paper, we consider the problem of adaptively utilizing a given shaping reward function. We formulate the utilization of shaping rewards as a bi-level optimization problem, where the lower level is to optimize policy using the shaping rewards and the upper level is to optimize a parameterized shaping weight function for true reward maximization. We formally derive the gradient of the expected true reward with respect to the shaping weight function parameters and accordingly propose three learning algorithms based on different assumptions. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards, and meanwhile ignore unbeneficial shaping rewards or even transform them into beneficial ones.
Deep Reactive Planning in Dynamic Environments
Ota, Kei, Jha, Devesh K., Onishi, Tadashi, Kanezaki, Asako, Yoshiyasu, Yusuke, Sasaki, Yoko, Mariyama, Toshisada, Nikovski, Daniel
The main novelty of the proposed approach is that it allows a robot to learn an end-to-end policy which can adapt to changes in the environment during execution. While goal conditioning of policies has been studied in the RL literature, such approaches are not easily extended to cases where the robot's goal can change during execution. This is something that humans are naturally able to do. However, it is difficult for robots to learn such reflexes (i.e., to naturally respond to dynamic environments), especially when the goal location is not explicitly provided to the robot, and instead needs to be perceived through a vision sensor. In the current work, we present a method that can achieve such behavior by combining traditional kinematic planning, deep learning, and deep reinforcement learning in a synergistic fashion to generalize to arbitrary environments. We demonstrate the proposed approach for several reaching and pick-and-place tasks in simulation, as well as on a real system of a 6-DoF industrial manipulator. A video describing our work could be found \url{https://youtu.be/hE-Ew59GRPQ}.
How to Make Sense of the Reinforcement Learning Agents? - KDnuggets
Based on simply watching how an agent acts in the environment it is hard to tell anything about why it behaves this way and how it works internally. That's why it is crucial to establish metrics that tell WHY the agent performs in a certain way. This is challenging especially when the agent doesn't behave the way we would like it to behave, โฆ which is like always. Every AI practitioner knows that whatever we work on, most of the time it won't simply work out of the box (they wouldn't pay us so much for it otherwise). In this blog post, you'll learn what to keep track of to inspect/debug your agent learning trajectory. I'll assume you are already familiar with the Reinforcement Learning (RL) agent-environment setting (see Figure 1) and you've heard about at least some of the most common RL algorithms and environments. Nevertheless, don't worry if you are just beginning your journey with RL.
Google, OpenAI & DeepMind: Shared Task Behaviour Priors Can Boost RL and Generalization
Researchers in recent years have deployed reinforcement learning (RL) agents to solve increasingly challenging problems. As the trend continues, so has the development of new methods that enable the injection of "priors" (prior knowledge) into agents to help them better understand the structure of the world and come up with more effective solution strategies. In a new paper, researchers from Google, OpenAI, and DeepMind introduce "behaviour priors," a framework designed to capture common movement and interaction patterns that are shared across a set of related tasks or contexts. The researchers discuss how such behaviour patterns can be captured using probabilistic trajectory models and how they can be integrated effectively into RL schemes, such as for facilitating multi-task and transfer learning. Their method for learning behaviour priors can lead to significant speedups on complex tasks, the researchers say.
Deploying reinforcement learning in production using Ray and Amazon SageMaker
Reinforcement learning (RL) is used to automate decision-making in a variety of domains, including games, autoscaling, finance, robotics, recommendations, and supply chain. Launched at AWS re:Invent 2018, Amazon SageMaker RL helps you quickly build, train, and deploy policies learned by RL. Ray is an open-source distributed execution framework that makes it easy to scale your Python applications. Amazon SageMaker RL uses the RLlib library that builds on the Ray framework to train RL policies. This post walks you through the tools available in Ray and Amazon SageMaker RL that help you address challenges such as scale, security, iterative development, and operational cost when you use RL in production.
Adaptive Stress Testing of Trajectory Predictions in Flight Management Systems
Moss, Robert J., Lee, Ritchie, Visser, Nicholas, Hochwarth, Joachim, Lopez, James G., Kochenderfer, Mykel J.
To find failure events and their likelihoods in flight-critical systems, we investigate the use of an advanced black-box stress testing approach called adaptive stress testing. We analyze a trajectory predictor from a developmental commercial flight management system which takes as input a collection of lateral waypoints and en-route environmental conditions. Our aim is to search for failure events relating to inconsistencies in the predicted lateral trajectories. The intention of this work is to find likely failures and report them back to the developers so they can address and potentially resolve shortcomings of the system before deployment. To improve search performance, this work extends the adaptive stress testing formulation to be applied more generally to sequential decision-making problems with episodic reward by collecting the state transitions during the search and evaluating at the end of the simulated rollout. We use a modified Monte Carlo tree search algorithm with progressive widening as our adversarial reinforcement learner. The performance is compared to direct Monte Carlo simulations and to the cross-entropy method as an alternative importance sampling baseline. The goal is to find potential problems otherwise not found by traditional requirements-based testing. Results indicate that our adaptive stress testing approach finds more failures and finds failures with higher likelihood relative to the baseline approaches.
Harnessing Distribution Ratio Estimators for Learning Agents with Quality and Diversity
Gangwani, Tanmay, Peng, Jian, Zhou, Yuan
The goal in Reinforcement Learning (RL) is to learn agents that maximize long-term environmental rewards. Deep RL, which uses deep neural networks as function approximators for the policy and value-functions, has achieved outstanding results on a wide variety of sequential decision making problems, with the barometer of success usually being the total returns accumulated by the final policy. Due to the intrinsic nature of direct reward maximization, seldom is the focus on how the behavioral characteristics of the trained agent compare with the other possible behaviors in the solution space. For instance, consider the robotic manipulator arm in Figure 1a and the peg-insertion task. Though the task description is simple, for a sufficiently flexible arm, there are numerous ways (positions of the joints and the end-effector) to insert the peg in the hole (Figure 1b).