"Short-length" Adversarial Training Helps LLMs Defend "Long-length" Jailbreak Attacks: Theoretical and Empirical Evidence

Feb-6-2025–arXiv.org Machine Learning

Large language models (LLMs) (Brown et al., 2020; Touvron et al., 2023a; Liu et al., 2024a; Yang et al., 2024a) have been widely integrated into various real-world applications to assist human users, but their safety is found to be vulnerable toward jailbreak attacks (Wei et al., 2023). With carefully crafted adversarial prompts, one can "jailbreak" the safety mechanism of LLMs and induce arbitrary harmful behaviors (Zou et al., 2023; Chao et al., 2023; Liu et al., 2024b). To address this challenge, recent studies (Xhonneux et al., 2024; Mazeika et al., 2024; Yu et al., 2024; Casper et al., 2024) have proposed performing safety alignment through adversarial training (AT) (Madry et al., 2018) to enhance LLMs' robustness against jailbreaking. A standard AT for LLMs would train them on harmful adversarial prompts synthesized by strong jailbreak attacks to learn to refuse these harmful instructions (Mazeika et al., 2024). In such AT, the length of synthesized adversarial prompts used for model training is critical to the final jailbreak robustness of LLMs. Anil et al. (2024) and Xu et al. (2024) have shown that longer adversarial prompts enjoy stronger jailbreaking abilities. Thus, it is reasonable to deduce that performing AT with longer adversarial prompts can help LLMs achieve stronger robustness to defend against "long-length" jailbreak attacks. However, synthesizing long-length adversarial prompts in adversarial training is usually time-consuming since it requires solving discrete optimization problems in high-dimensional spaces. This may limit the application of AT in LLMs' safety alignment and further raises the following research question: How will the adversarial prompt length during AT affect trained LLMs' robustness against jailbreaking with different prompt lengths? S. Fu and D. Wang are with the Division of Computer, Electrical and Mathematical Science and Engineering (CEMSE) at the King Abdullah University of Science and Technology, Thuwal 23955, KSA.

large language model, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

Feb-6-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Energy > Oil & Gas
  - Midstream (0.50)
- Materials > Chemicals
  - Commodity Chemicals > Petrochemicals
    - LNG (0.50)
  - Industrial Gases > Liquified Gas (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language > Large Language Model (1.00)