Imitation Learning by Reinforcement Learning

Ciosek, Kamil

arXiv.org Machine Learning 

Typically, Reinforcement Learning (RL) assumes access to a pre-specified reward and then learns a policy maximizing the expected average of this reward along a trajectory. However, specifying rewards is difficult for many practical tasks (Atkeson & Schaal, 1997; Zhang et al., 2018; Ibarz et al., 2018). In such cases, it is convenient to instead perform Imitation Learning (IL), learning a policy from expert demonstrations. There are two major categories of Imitation Learning algorithms: Behavioral Cloning and Inverse Reinforcement Learning. Behavioral Cloning learns the policy by supervised learning on expert data, but is not robust to training errors, failing in settings where expert data is limited (Ross & Bagnell, 2010). Inverse Reinforcement Learning (IRL) achieves improved performance on limited data by constructing reward signals and calling an RL oracle to maximize these rewards (Ng et al., 2000).