Rule Based Rewards for Language Model Safety
–Neural Information Processing Systems
We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data.
Neural Information Processing Systems
Feb-18-2026, 01:00:29 GMT