Rule Based Rewards for Language Model Safety

Open in new window