Interpreting Language Reward Models via Contrastive Explanations

Jiang, Junqi, Bewley, Tom, Mishra, Saumitra, Lecue, Freddy, Veloso, Manuela

arXiv.org Artificial Intelligence 

Reward models (RMs) are a crucial component in the alignment of large language models' (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM's local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment. The training of safe and capable large language models (LLMs) typically involves a fine-tuning step to align their outputs with human preferences. Recent work by Xu et al. (2024) suggests that fine-tuning by reinforcement learning using a language reward model (RM), which represents these preferences by rating the quality of LLM responses to user prompts, remains the state-of-the-art alignment method. In such frameworks, the effectiveness of alignment heavily depends on the quality of the RM itself (Chaudhari et al., 2024). While a growing body of research aims at improving the performance of RMs (Bai et al., 2022; Chan et al., 2024; Wang et al., 2024a), evaluating and understanding RMs has received "relatively little study" (Lambert et al., 2024).