A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback

Kim, Kihyun, Zhang, Jiawei, Ozdaglar, Asuman, Parrilo, Pablo A.

arXiv.org Machine Learning 

Reward learning involves inferring and shaping the underlying reward function from observed human demonstrations and feedback. Inverse reinforcement learning (IRL) and reinforcement learning from human feedback (RLHF, also known as preference-based reinforcement learning) are key methodologies in reward learning, applied in various sequential decision-making tasks such as games [1-3], robotics [4-6], and language models [7-11]. Particularly in the recent drastic development of large language models (LLMs), RLHF has played a crucial role in fine-tuning models to better align with human preferences [10]. However, despite the notable empirical success of these algorithms, a significant gap remains in the theoretical analysis of IRL and RLHF, limiting us to guarantee their reliability. This work aims to bridge this gap by proposing a novel theoretical framework for offline IRL and RLHF. IRL aims to infer a reward function that aligns with an expert behavior from demonstrations [12, 13]. Typical IRL algorithms employ a bi-level optimization framework within the context of maximum likelihood estimation (MLE). In this framework, the inner optimization evaluates the policy based on the current reward parameters, while the outer optimization updates these parameters to better match observed expert behavior. These algorithms have been extensively explored in the literature [14-18], and their convergence is studied in both online settings [17] and offline settings [18].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found