Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Open in new window