Preference Poisoning Attacks on Reward Model Learning

Open in new window