Reward Model Interpretability via Optimal and Pessimal Tokens

Open in new window