Adversarial Skill Chaining for Long-Horizon Robot Manipulation via Terminal State Regularization

Lee, Youngwoon, Lim, Joseph J., Anandkumar, Anima, Zhu, Yuke

arXiv.org Artificial Intelligence 

Deep reinforcement learning (RL) presents a promising framework for learning impressive robot behaviors [1-4]. Yet, learning a complex long-horizon task using a single control policy is still challenging mainly due to its high computational costs and the exploration burdens of RL models [5]. A more practical solution is to decompose a whole task into smaller chunks of subtasks, learn a policy for each subtask, and sequentially execute the subtasks to accomplish the entire task [6-9]. However, naively executing one policy after another would fail when the subtask policy encounters a starting state never seen during training [6, 7, 9]. In other words, a terminal state of one subtask may fall outside of the set of starting states that the next subtask policy can handle, and thus fail to accomplish the subtask, as illustrated in Figure 1a. Especially in robot manipulation, complex interactions between a high-DoF robot and multiple objects could lead to a wide range of robot and object configurations, which are infeasible to be covered by a single policy [10]. Therefore, skill chaining with policies with limited capability is not trivial and requires adapting the policies to make them suitable for sequential execution. To resolve the mismatch between the terminal state distribution (i.e.