Generalizing Reward Modeling for Out-of-Distribution Preference Learning

Open in new window