imitation learner
Imitation Learning by Reinforcement Learning
Typically, Reinforcement Learning (RL) assumes access to a pre-specified reward and then learns a policy maximizing the expected average of this reward along a trajectory. However, specifying rewards is difficult for many practical tasks (Atkeson & Schaal, 1997; Zhang et al., 2018; Ibarz et al., 2018). In such cases, it is convenient to instead perform Imitation Learning (IL), learning a policy from expert demonstrations. There are two major categories of Imitation Learning algorithms: Behavioral Cloning and Inverse Reinforcement Learning. Behavioral Cloning learns the policy by supervised learning on expert data, but is not robust to training errors, failing in settings where expert data is limited (Ross & Bagnell, 2010). Inverse Reinforcement Learning (IRL) achieves improved performance on limited data by constructing reward signals and calling an RL oracle to maximize these rewards (Ng et al., 2000).
Fully General Online Imitation Learning
Cohen, Michael K., Hutter, Marcus, Nanda, Neel
In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work provides formal guidance in how this might be accomplished, instead restricting focus to environments that restart, making learning unusually easy, and conveniently limiting the significance of any mistake. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.