Understanding and Enhancing the Transferability of Jailbreaking Attacks
Lin, Runqi, Han, Bo, Li, Fengwang, Liu, Tongling
–arXiv.org Artificial Intelligence
Content Warning: This paper contains examples of harmful language. Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. Nevertheless, these adversarial sequences fail to mislead the target LLM's intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent distributional dependency within the generated adversarial sequences, whose effectiveness stems from overfitting the source LLM's parameters, resulting in limited transferability to target LLMs. To this end, we propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input, thus obscuring malicious-intent tokens without relying on overfitted adversarial sequences. Extensive experiments demonstrate that PiF provides an effective and efficient red-teaming evaluation for proprietary LLMs. Empowered by massive corpus, large language models (LLMs) have achieved human-level conversational capabilities (OpenAI, 2023a; Google, 2023; Meta, 2024) and are widely employed in real-world applications. However, their training corpus is mainly crawled from the Internet without thorough ethical review, raising concerns about the potential risks associated with LLMs. Recent red-teaming efforts highlight that jailbreaking attacks can effectively disrupt LLMs to produce undesirable content with harmful consequences (Perez et al., 2022; Ganguli et al., 2022; Ouyang et al., 2022). Unlike model-level jailbreaks that necessitate parameter modifications and are restricted to opensource LLMs (Qi et al., 2024; Huang et al., 2023a), token-level and prompt-level jailbreaks can generate transferable adversarial sequences (Yu et al., 2023; Lapid et al., 2023), thus posing a potential threat to widespread proprietary LLMs (Zou et al., 2023; Chao et al., 2023). Nevertheless, empirical results indicate that these adversarial sequences lack reliable transferability, failing to consistently manipulate target LLMs (Chao et al., 2024; Chen et al., 2024). Furthermore, these lengthy adversarial sequences can be further countered by adaptive jailbreaking detection and defence (Alon & Kamfonas, 2023; Inan et al., 2023; Robey et al., 2023; Wang et al., 2024a). As depicted in Figure 1, developing jailbreak attacks that can reliably identify vulnerabilities in proprietary LLMs--thereby promoting human alignment and preventing future misuse--remains a significant challenge. These attacks are initially generated on the source LLM (Llama-2-7B-Chat) and subsequently transferred to the target LLM (Llama-2-13B-Chat).
arXiv.org Artificial Intelligence
Feb-5-2025
- Country:
- North America > United States (0.46)
- Oceania > Australia (0.14)
- Africa (0.04)
- South America > Chile
- Asia > China
- Hong Kong (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)
- Media > News (0.67)
- Law > Criminal Law (0.67)
- Health & Medicine (0.67)
- Government > Regional Government (0.67)
- Technology: