Interpreting Language Reward Models via Contrastive Explanations

Open in new window