Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedback

Jan-19-2025, 16:23:56 GMT–Neural Information Processing Systems

However, existing RLHF models are considered inefficient as they produce only a single preference data from each human feedback. To tackle this problem, we propose a novel RLHF framework called SeqRank, that uses sequential preference ranking to enhance the feedback efficiency. Our method samples trajectories in a sequential manner by iteratively selecting a defender from the set of previously chosen trajectories \mathcal{K} and a challenger from the set of unchosen trajectories \mathcal{U}\setminus\mathcal{K}, where \mathcal{U} is the replay buffer. We propose two trajectory comparison methods with different defender sampling strategies: (1) sequential pairwise comparison that selects the most recent trajectory and (2) root pairwise comparison that selects the most preferred trajectory from \mathcal{K} . We construct a data structure and rank trajectories by preference to augment additional queries.

efficient reinforcement learning, pairwise comparison, sequential preference ranking, (7 more...)

Neural Information Processing Systems

Jan-19-2025, 16:23:56 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)