Learning Goal-Conditioned Representations for Language Reward Models

Jun-2-2025, 00:28:44 GMT–Neural Information Processing Systems

Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning.Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback on language models.In this work, we propose training reward models (RMs) in a contrastive, \textit{goal-conditioned} fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories.This objective significantly improves reward model performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe 2.3\% increase in accuracy.Beyond improving reward model performance, we show this way of training RM representations enables improved steerability because it allows us to evaluate the likelihood of an action achieving a particular goal-state (e.g.

language reward model, learning goal-conditioned representation, reward model performance, (3 more...)

Neural Information Processing Systems

Jun-2-2025, 00:28:44 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (0.51)
  - Neural Networks > Deep Learning (0.40)