On Time-Indexing as Inductive Bias in Deep RL for Sequential Manipulation Tasks

Qureshi, M. Nomaan, Eisner, Ben, Held, David

arXiv.org Artificial Intelligence 

In standard policy learning, a single neural-network based policy is tasked with learning both of these skills (and learning to switch between them), without any access to structures that explicitly encode the multi-modal nature of task space.Ideally, policies would be able to emergently learn to decompose tasks at different levels of abstraction, and factor the task learning into unique skills. One common approach is to try and jointly learn a set of subskills, as well as a selection function which selects a specific subskill to execute at the current time step [5]. This poses a fundamental bootstrapping issue: as the skills change and improve, the selection function must change and improve as well, which can lead to unstable training. An important observation of many optimal policies for manipulation tasks is that skills tend to be executed in sequence, without backtracking. Therefore, time itself can serve as a useful indicator for skill selection. For instance, while executing a stacking task, it is reasonable to assume that the robot will undertake the'reach' skill at the start of the task, and subsequently perform the'stack' skill towards the end of the task. Our intuition here is that selecting the'skill' according to which time-step we are currently at can be used as a good strategy for selecting