Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

May-27-2025, 15:23:59 GMT–Neural Information Processing Systems

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings over pairs of trajectory segments, which fails to capture the varying strengths of preferences across different pairs. In this paper, we propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO), designed to address this uncertainty in preference strength. By incorporating an adaptive scaling parameter into the loss for each pair, our method increases the flexibility of the reward function. Specifically, it assigns small scaling parameters to pairs with ambiguous preferences, leading to more comparable rewards, and large scaling parameters to those with clear preferences for more distinct rewards.

adaptive preference scaling, optimization, reinforcement learning, (3 more...)

Neural Information Processing Systems

May-27-2025, 15:23:59 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (0.64)
  - Neural Networks > Deep Learning (0.62)