Robust LLM safeguarding via refusal feature adversarial training

Open in new window