Rule Based Rewards for Language Model Safety
–Neural Information Processing Systems
We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data.
Neural Information Processing Systems
Nov-20-2025, 13:37:38 GMT