Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Jun-11-2026, 09:50:37 GMT–Neural Information Processing Systems

Despite substantial efforts in safety alignment, recent research indicates that Large Language Models (LLMs) remain highly susceptible to jailbreak attacks.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Jun-11-2026, 09:50:37 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report > New Finding (0.39)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.84)