Playing Language Game with LLMs Leads to Jailbreaking

Peng, Yu, Long, Zewen, Dong, Fangming, Li, Congyi, Wu, Shu, Chen, Kai

Nov-27-2024–arXiv.org Artificial Intelligence

The advent of large language models (LLMs) has spurred the development of numerous jailbreak techniques aimed at circumventing their security defenses against malicious attacks. An effective jailbreak approach is to identify a domain where safety generalization fails, a phenomenon known as mismatched generalization. In this paper, we introduce two novel jailbreak methods based on mismatched generalization: natural language games and custom language games, both of which effectively bypass the safety mechanisms of LLMs, with various kinds and different variants, making them hard to defend and leading to high attack rates. Natural language games involve the use of synthetic linguistic constructs and the actions intertwined with these constructs, such as the Ubbi Dubbi language. Building on this phenomenon, we propose the custom language games method: by engaging with LLMs using a variety of custom rules, we successfully execute jailbreak attacks across multiple LLM platforms. Extensive experiments demonstrate the effectiveness of our methods, achieving success rates of 93% on GPT-4o, 89% on GPT-4o-mini and 83% on Claude-3.5-Sonnet. Furthermore, to investigate the generalizability of safety alignments, we fine-tuned Llama-3.1-70B with the custom language games to achieve safety alignment within our datasets and found that when interacting through other language games, the fine-tuned models still failed to identify harmful content. This finding indicates that the safety alignment knowledge embedded in LLMs fails to generalize across different linguistic formats, thus opening new avenues for future research in this area. Our code is available at https://anonymous.4open.science/r/encode Warning: this paper contains examples with unsafe content. Large language models (LLMs) such as ChatGPT (Achiam et al., 2023), Llama2 (Touvron et al., 2023), Claude2 (Anthropic, 2023) and Gemini (Team et al., 2023) have become increasingly important across various domains due to their advanced natural language comprehension and generation capabilities. These models are employed in a wide range of applications, including customer service, content generation, code assistance, and even medical diagnostics, offering valuable suggestions and improving productivity in numerous scenarios. However, with this growing prominence comes a heightened risk: the rapid development of attack schemes that are designed to manipulate or deceive these models into generating unsafe or unethical content.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Nov-27-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.93)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)