Latent Adversarial Training Improves the Representation of Refusal

Open in new window