Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
–Neural Information Processing Systems
Despite substantial efforts in safety alignment, recent research indicates that Large Language Models (LLMs) remain highly susceptible to jailbreak attacks.
Neural Information Processing Systems
Jun-11-2026, 09:50:37 GMT