Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
–Neural Information Processing Systems
Setting up a well-designed reward function has been challenging for many reinforcement learning applications. Preference-based reinforcement learning (PbRL) provides a new framework that avoids reward engineering by leveraging human preferences (i.e., preferring apples over oranges) as the reward signal. Therefore, improving the efficacy of data usage for preference data becomes critical. In this work, we propose Meta-Reward-Net (MRN), a data-efficient PbRL framework that incorporates bi-level optimization for both reward and policy learning. The key idea of MRN is to adopt the performance of the Q-function as the learning target.
implicitly differentiable reward learning, learning, preference-based reinforcement learning, (4 more...)
Neural Information Processing Systems
Jan-17-2025, 15:43:47 GMT
- Technology: