Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Neural Information Processing Systems 

Despite substantial efforts in safety alignment, recent research indicates that Large Language Models (LLMs) remain highly susceptible to jailbreak attacks.