Kidziński, Łukasz, Ong, Carmichael, Mohanty, Sharada Prasanna, Hicks, Jennifer, Carroll, Sean F., Zhou, Bo, Zeng, Hongsheng, Wang, Fan, Lian, Rongzhong, Tian, Hao, Jaśkowski, Wojciech, Andersen, Garrett, Lykkebø, Odd Rune, Toklu, Nihat Engin, Shyam, Pranav, Srivastava, Rupesh Kumar, Kolesnikov, Sergey, Hrinchuk, Oleksii, Pechenko, Anton, Ljungström, Mattias, Wang, Zhen, Hu, Xu, Hu, Zehong, Qiu, Minghui, Huang, Jun, Shpilman, Aleksei, Sosin, Ivan, Svidchenko, Oleg, Malysheva, Aleksandra, Kudenko, Daniel, Rane, Lance, Bhatt, Aditya, Wang, Zhengfei, Qi, Penghui, Yu, Zeyang, Peng, Peng, Yuan, Quan, Li, Wenxin, Tian, Yunsheng, Yang, Ruihan, Ma, Pingchuan, Khadka, Shauharda, Majumdar, Somdeb, Dwiel, Zach, Liu, Yinyin, Tumer, Evren, Watson, Jeremy, Salathé, Marcel, Levine, Sergey, Delp, Scott
In the NeurIPS 2018 Artificial Intelligence for Prosthetics challenge, participants were tasked with building a controller for a musculoskeletal model with a goal of matching a given time-varying velocity vector. Top participants were invited to describe their algorithms. In this work, we describe the challenge and present thirteen solutions that used deep reinforcement learning approaches. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each team implemented different modifications of the known algorithms by, for example, dividing the task into subtasks, learning low-level control, or by incorporating expert knowledge and using imitation learning.
In this paper, we describe NeurIPS 2019 Learning to Move - Walk Around challenge physics-based environment and present our solution to this competition which scored 1303.727 mean reward points and took 3rd place. Our method combines recent advances from both continuous- and discrete-action space reinforcement learning, such as Soft Actor-Critic and Recurrent Experience Replay in Distributed Reinforcement Learning. We trained our agent in two stages: to move somewhere at the first stage and to follow the target velocity field at the second stage. We also introduce novel Q-function split technique, which we believe facilitates the task of training an agent, allows critic pretraining and reusing it for solving harder problems, and mitigate reward shaping design efforts.
We optimize a six degrees of freedom hovering policy using reinforcement meta-learning. The policy maps flash LIDAR measurements directly to on/off spacecraft body-frame thrust commands, allowing hovering at a fixed position and attitude in the asteroid body-fixed reference frame. Importantly, the policy does not require position and velocity estimates, and can operate in environments with unknown dynamics, and without an asteroid shape model or navigation aids. Indeed, during optimization the agent is confronted with a new randomly generated asteroid for each episode, insuring that it does not learn an asteroid's shape, texture, or environmental dynamics. This allows the deployed policy to generalize well to novel asteroid characteristics, which we demonstrate in our experiments. The hovering controller has the potential to simplify mission planning by allowing asteroid body-fixed hovering immediately upon the spacecraft's arrival to an asteroid. This in turn simplifies shape model generation and allows resource mapping via remote sensing immediately upon arrival at the target asteroid.
We present a ping-pong-playing robot that learns to improve its swings with human advice. Our method learns a reward function over the joint space of task and policy parameters T×P, so the robot can explore policy space more intelligently in a way that trades off exploration vs. exploitation to maximize the total cumulative reward over time. Multimodal stochastic polices can also easily be learned with this approach when the reward function is multimodal in the policy parameters. We extend the recently-developed Gaussian Process Bandit Optimization framework to include exploration-bias advice from human domain experts, using a novel algorithm called Exploration Bias with Directional Advice (EBDA).