Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints

Yang, Junxiao, Zhang, Zhexin, Cui, Shiyao, Wang, Hongning, Huang, Minlie

Feb-25-2025–arXiv.org Artificial Intelligence

Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints-specifically, the response pattern constraint and the token tail constraint-as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.

constraint, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Feb-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > Canada (0.14)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (0.47)
- Materials > Chemicals
  - Commodity Chemicals (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)