Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedback

Neural Information Processing Systems 

However, existing RLHF models are considered inefficient as they produce only a single preference data from each human feedback. To tackle this problem, we propose a novel RLHF framework called SeqRank, that uses sequential preference ranking to enhance the feedback efficiency. Our method samples trajectories in a sequential manner by iteratively selecting a defender from the set of previously chosen trajectories \mathcal{K} and a challenger from the set of unchosen trajectories \mathcal{U}\setminus\mathcal{K}, where \mathcal{U} is the replay buffer. We propose two trajectory comparison methods with different defender sampling strategies: (1) sequential pairwise comparison that selects the most recent trajectory and (2) root pairwise comparison that selects the most preferred trajectory from \mathcal{K} . We construct a data structure and rank trajectories by preference to augment additional queries.