Group Robust Preference Optimization in Reward-free RLHF

Neural Information Processing Systems 

While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches