On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration

Zhou, Yirui, Liu, Xiaowei, Zhang, Xiaofeng, Zhang, Yangchun

arXiv.org Machine Learning 

Imitation learning (IL) (Pomerleau, 1991; Ng et al., 2000; Syed and Schapire, 2007; Ho and Ermon, 2016), a realm distinct from standard reinforcement learning (RL) (Puterman, 2014; Sutton and Barto, 2018), is independent on rewards provided by the environment. This characteristic makes IL particularly suited for numerous real-world applications (Bhattacharyya et al., 2018; Shi et al., 2019; Jabri, 2021). The general IL paradigm leverages the guidance from expert demonstrations with information of both states and actions to mimic an outstanding policy (Abbeel and Ng, 2004; Ho and Ermon, 2016; Kostrikov et al., 2020). According to the strategy of policy training, IL is divided into two main schemes based on policy training strategy: on-policy and off-policy training. The on-policy scheme (Ho and Ermon, 2016; Chen et al., 2020) is noted for its stability but requires a significant volume of samples.