Rule Based Rewards for Language Model Safety
–Neural Information Processing Systems
We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data.
Neural Information Processing Systems
Oct-10-2025, 16:03:57 GMT