Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedback
–Neural Information Processing Systems
However, existing RLHF models are considered inefficient as they produce only a single preference data from each human feedback. To tackle this problem, we propose a novel RLHF framework called SeqRank, that uses sequential preference ranking to enhance the feedback efficiency. Our method samples trajectories in a sequential manner by iteratively selecting a defender from the set of previously chosen trajectories \mathcal{K} and a challenger from the set of unchosen trajectories \mathcal{U}\setminus\mathcal{K}, where \mathcal{U} is the replay buffer. We propose two trajectory comparison methods with different defender sampling strategies: (1) sequential pairwise comparison that selects the most recent trajectory and (2) root pairwise comparison that selects the most preferred trajectory from \mathcal{K} . We construct a data structure and rank trajectories by preference to augment additional queries.
Neural Information Processing Systems
Jan-19-2025, 16:23:56 GMT
- Technology: