Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Mar-21-2026, 04:13:13 GMT–Neural Information Processing Systems

While Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to jailbreaking attacks. Several primary defense strategies have been proposed to protect LLMs from producing harmful information, mostly focusing on model fine-tuning or heuristical defense designs. However, how to achieve intrinsic robustness through prompt optimization remains an open problem.

large language model, natural language, proceedings, (4 more...)

Neural Information Processing Systems

Mar-21-2026, 04:13:13 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)