Interpretable Reward Model via Sparse Autoencoder

Open in new window