Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Ji, Jiabao, Hou, Bairu, Robey, Alexander, Pappas, George J., Hassani, Hamed, Zhang, Yang, Wong, Eric, Chang, Shiyu

Feb-28-2024–arXiv.org Artificial Intelligence

Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-28-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.14)
  - South Carolina > Charleston County (0.14)

Genre:
- Instructional Material (1.00)
- Research Report > New Finding (0.48)
- Workflow (1.00)

Industry:
- Government (1.00)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
- Law Enforcement & Public Safety (1.00)
- Media > News (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found