Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Open in new window