Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Lapid, Raz, Langberg, Ron, Sipper, Moshe

arXiv.org Artificial Intelligence 

Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM's outputs for unintended purposes. In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that--when combined with a user's query--disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model's limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. Through extensive experiments we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible AI development by providing a diagnostic tool for evaluating and enhancing alignment of LLMs with human intent. To our knowledge this is the first automated universal black box jailbreak attack. Large language models (LLMs) are generally trained using extensive text datasets gathered from the internet, which have been shown to encompass a considerable volume of objectionable material. As a result, contemporary LLM developers have adopted the practice of "aligning" (Wang et al., 2023) such models through a variety of fine-tuning mechanisms. Various techniques are employed for this purpose (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022) with the overall objective being that of preventing LLMs from producing harmful or objectionable outputs in response to user queries. At least superficially these endeavors appear to be successful: public chatbots refrain from generating overtly inappropriate content when directly questioned.