Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning

Jun-13-2026, 12:32:17 GMT–Neural Information Processing Systems

Reinforcement learning (RL) heavily depends on well-designed reward functions, which are often biased and difficult to design for complex behaviors. Preference-based RL (PbRL) addresses this by learning reward models from human feedback, but its practicality is constrained by a critical dilemma: while existing methods reduce human effort through query optimization, they neglect the preference buffer's restricted coverage -- a factor that fundamentally determines the reliability of reward model. We systematically demonstrate this limitation creates distributional mismatch: reward models trained on static buffers reliably assess in-distribution trajectories but falter with out-of-distribution (OOD) trajectories from policy exploration.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Jun-13-2026, 12:32:17 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)