proman
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Zhang, Hangfan, Guo, Zhimeng, Zhu, Huaisheng, Cao, Bochuan, Lin, Lu, Jia, Jinyuan, Chen, Jinghui, Wu, Dinghao
Large Language Models (LLMs) have achieved unprecedented performance in Natural Language Generation (NLG) tasks. However, many existing studies have shown that they could be misused to generate undesired content. In response, before releasing LLMs for public access, model developers usually align those language models through Supervised Fine-Tuning (SFT) or Reinforcement Learning with Human Feedback (RLHF). Consequently, those aligned large language models refuse to generate undesired content when facing potentially harmful/unethical requests. A natural question is "could alignment really prevent those open-sourced large language models from being misused to generate undesired content?". In this work, we provide a negative answer to this question. In particular, we show those open-sourced, aligned large language models could be easily misguided to generate undesired content without heavy computations or careful prompt designs. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content including harmful or biased information and even private data. We evaluate our method on 4 open-sourced LLMs accessible publicly and our finding highlights the need for more advanced mitigation strategies for open-sourced LLMs. Warning: This paper contains examples of harmful language generated by LLMs. Since the release of ChatGPT (Brown et al., 2020; OpenAI, 2023a;b), extensive attention has been paid to the development and application of Large Language Models (LLMs). Over the past year, many advanced LLMs (Touvron et al., 2023; Zheng et al., 2023; Dettmers et al., 2023; Zeng et al., 2022) have been open-sourced on model-sharing platforms such as HuggingFace (HuggingFace, 2023a). On the other hand, in practice, most LLMs are trained on publicly available online corpora (OpenAI, 2023b; Touvron et al., 2023; Zheng et al., 2023). Consequently, LLMs have unavoidably viewed harmful content during the training phase, which naturally raises the concern that LLMs can be misused to generate such content, e.g., retrieving information about harmful topics like cybercrime (Kang et al., 2023; Liu et al., 2023; Greshake et al., 2023; Zou et al., 2023). In response, LLM developers (e.g., OpenAI) commonly align LLMs through Supervised Fine-Tuning (SFT) or Reinforcement Learning with Human Feedback (RLHF) so that LLMs will not generate undesired content (OpenAI, 2023b; Touvron et al., 2023; Wang et al., 2023). For instance, OpenAI adopted SFT and RLHF to develop powerful LLMs such as InstructGPT (Ouyang et al., 2022) and ChatGPT (OpenAI, 2023a) with remarkable improvement in understanding human instructions and avoiding undesired output. Si et al. (2023) adopted prompt tuning to remove biased content in responses generated by GPT-3 (Brown et al., 2020).