Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Luo, Yifan, Zhou, Zhennan, Wang, Meitan, Dong, Bin
–arXiv.org Artificial Intelligence
In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular opensource LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM's safety and deepens the understanding of the internal mechanisms of LLMs. The capabilities of large language models (LLMs) have improved rapidly in recent years (Achiam et al., 2023; Anthropic, 2023; Touvron et al., 2023). One of the primary ways of deploying LLMs in practice is through chatbots. Instruction fine-tuning is the most common approach for transforming a pre-trained LLM into an effective chatbot (Wei et al., 2021; Ouyang et al., 2022; Chung et al., 2022). This process involves training the model on a variety of prompt-response pairs, marked with special tokens, to guide the model in following instructions and generating helpful, relevant responses.
arXiv.org Artificial Intelligence
Oct-14-2024