Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning

Verma, Mudit, Bhambri, Siddhant, Kambhampati, Subbarao

arXiv.org Artificial Intelligence 

However, much of the successes reinforces the fact that much of the explored trajectories have also been attributed to well specified reward functions do not actually participate in the reward learning process which ground the agent's behavior and subsequent task in and in fact their best change to affect the reward function the expected manner. As prior works have argued, the specification is once they get sampled and queried to the human in the of low level reward functions for seemingly easy loop. We posit that this untapped data source can greatly tasks could be quite difficult and may still result in inexplicable improve reward recovery and reduce the feedback sample and unexpected results (Verma et al. 2019, 2021; complexity. Our second observation notes that the reward Gopalakrishnan, Verma, and Kambhampati 2021a,b) potentially function being learnt may not conform to the structure of affecting trust between Human-AI (Zahedi et al. 2021, the state space simply because it doesn't get exposed to as 2022). For example, works like (Krakovna et al. 2020; Vamplew many data points (in comparison to, say, the policy approximation et al. 2018) have raised the issues of reward hacking function). We utilize our observations to improve and reward exploitation where the RL agents would discover performance of RL agents in recovering the underlying behaviors that seems to be "cheating" or incorrect and reward function and learn a good policy by exploiting the yet maximize the expected cumulative reward. This has also rich unlabeled trajectory data. Although works like SURF gotten attention from the explainable AI community where (Park et al. 2022) have proposed a semi-supervised learning they attempt to analyze whether the agent is actually behaving approach to utilize unlabeled trajectory data, they would in the intended manner (Verma, Kharkwal, and Kambhampati generate labels for unlabeled trajectories and use these 2022; Sreedharan et al. 2020; Kambhampati et al. data points as if they were given by the human in the loop.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found