CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Lv, Huijie, Wang, Xiao, Zhang, Yuansen, Huang, Caishuang, Dou, Shihan, Ye, Junjie, Gui, Tao, Zhang, Qi, Huang, Xuanjing

Feb-26-2024–arXiv.org Artificial Intelligence

Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-26-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)
- North America > United States (0.28)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.88)
- Law (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.50)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found