Learning Goal-Conditioned Representations for Language Reward Models
–Neural Information Processing Systems
Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning.Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback on language models.In this work, we propose training reward models (RMs) in a contrastive, \textit{goal-conditioned} fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories.This objective significantly improves reward model performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe 2.3\% increase in accuracy.Beyond improving reward model performance, we show this way of training RM representations enables improved steerability because it allows us to evaluate the likelihood of an action achieving a particular goal-state (e.g.
Neural Information Processing Systems
Jun-2-2025, 00:28:44 GMT
- Technology: