Reviews: Reward learning from human preferences and demonstrations in Atari
–Neural Information Processing Systems
As the title implies, this paper examines imitation learning that combines human demonstrations and human preferences. The main algorithm builds on DQFD to learn Q-Values from human demonstrations and subsequently fine-tunes the policy using preference elicitation methods. More specifically, preferences are compiled into a surrogate reward function which is then used to further optimize the policy. The resulting algorithm is validated on nine Atari environments and results show that the technique of combining demonstrations with preferences is better than either using either source of feedback alone. Overall, the paper is clearly written, tackles a well-scoped problem, and presents compelling results.
Neural Information Processing Systems
Oct-7-2024, 17:51:52 GMT