Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Luo, Yifan, Zhou, Zhennan, Wang, Meitan, Dong, Bin

Oct-14-2024–arXiv.org Artificial Intelligence

In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular opensource LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM's safety and deepens the understanding of the internal mechanisms of LLMs. The capabilities of large language models (LLMs) have improved rapidly in recent years (Achiam et al., 2023; Anthropic, 2023; Touvron et al., 2023). One of the primary ways of deploying LLMs in practice is through chatbots. Instruction fine-tuning is the most common approach for transforming a pre-trained LLM into an effective chatbot (Wei et al., 2021; Ouyang et al., 2022; Chung et al., 2022). This process involves training the model on a variety of prompt-response pairs, marked with special tokens, to guide the model in following instructions and generating helpful, relevant responses.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-14-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > Experimental Study (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)
  - Natural Language > Large Language Model (1.00)