Inducing Structure in Reward Learning by Learning Features

Bobu, Andreea, Wiggert, Marius, Tomlin, Claire, Dragan, Anca D.

arXiv.org Artificial Intelligence 

In doing so, however, these approaches sacrifice the sample efficiency and generalizability that a well-specified feature Whether it's semi-autonomous driving (Sadigh et al. 2016), set offers. While using an expressive function approximator recommender systems (Ziebart et al. 2008), or household to extract features and learn their reward combination at once robots working in close proximity with people (Jain et al. seems advantageous, many such functions can induce policies 2015), reward learning can greatly benefit autonomous agents that explain the demonstrations. Hence, to disambiguate to generate behaviors that adapt to new situations or human between all these candidate functions, the robot requires a preferences. Under this framework, the robot uses the person's very large amount of (laborious to collect) data, and this data input to learn a reward function that describes how they prefer needs to be diverse enough to identify the true reward. For the task to be performed. For instance, in the scenario in Fig. example, the human in the household robot setting in Figure 1 1, the human wants the robot to keep the cup away from the might want to demonstrate keeping the cup away from the laptop to prevent spilling liquid over it; she may communicate laptop, but from a single demonstration the robot could find this preference to the robot by providing a demonstration of many other explanations for the person's behavior: perhaps the task or even by directly intervening during the robot's task they always happened to keep the cup upright or they really execution to correct it.