Reviews: Reward learning from human preferences and demonstrations in Atari

Oct-7-2024, 17:51:52 GMT–Neural Information Processing Systems

As the title implies, this paper examines imitation learning that combines human demonstrations and human preferences. The main algorithm builds on DQFD to learn Q-Values from human demonstrations and subsequently fine-tunes the policy using preference elicitation methods. More specifically, preferences are compiled into a surrogate reward function which is then used to further optimize the policy. The resulting algorithm is validated on nine Atari environments and results show that the technique of combining demonstrations with preferences is better than either using either source of feedback alone. Overall, the paper is clearly written, tackles a well-scoped problem, and presents compelling results.

demonstration, human preference and demonstration, preference feedback, (6 more...)

Neural Information Processing Systems

Oct-7-2024, 17:51:52 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report (0.60)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.40)