Endless Jailbreaks with Bijection Learning

Huang, Brian R. Y., Li, Maximilian, Tang, Leonard

arXiv.org Artificial Intelligence 

Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new vulnerabilities in frontier models can emerge with scale: more capable models are more severely jailbroken by bijection attacks. Figure 1: An overview of the bijection learning attack, which uses in-context learning and bijective mappings with complexity parameters to optimally jailbreak LLMs of different capability levels. Large language models (LLMs) have been the subject of concerns about safety and potential misuse. As systems like Claude and ChatGPT see widespread deployment, their strong world knowledge and reasoning capabilities may empower bad actors or amplify negative side-effects of downstream usage. Hence, model designers have prudently worked to safeguard models against harmful use. Model designers have trained LLMs to refuse harmful inputs with RLHF (Christiano et al., 2017; Bai et al., 2022) and adversarial training (Ziegler et al., 2022), and have guarded LLM systems However, prior work has shown that safeguards on language models can be circumvented through adversarial input schemes. One topic of debate is how models' vulnerability to jailbreaks changes with scale. Some works (Ren et al., 2024; Howe et al., 2024) argue that scaling model capabilities can improve performance on safety benchmarks, even if defense mechanisms do not meaningfully improve.