Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Wei, Boyi, Huang, Kaixuan, Huang, Yangsibo, Xie, Tinghao, Qi, Xiangyu, Xia, Mengzhou, Mittal, Prateek, Wang, Mengdi, Henderson, Peter
–arXiv.org Artificial Intelligence
Despite these efforts, recent studies have uncovered concerning'jailbreak' scenarios. In these cases, even well-aligned Large language models (LLMs) show inherent models have had their safeguards successfully breached (Albert, brittleness in their safety mechanisms, as evidenced 2023). These jailbreaks can include crafting adversarial by their susceptibility to jailbreaking and prompts (Wei et al., 2023; Jones et al., 2023; Carlini even non-malicious fine-tuning. This study explores et al., 2023; Zou et al., 2023b; Shen et al., 2023; Zhu et al., this brittleness of safety alignment by leveraging 2023; Qi et al., 2023), applying persuasion techniques (Zeng pruning and low-rank modifications. We develop et al., 2024), or manipulating the model's decoding process methods to identify critical regions that are (Huang et al., 2024). Recent studies show that finetuning vital for safety guardrails, and that are disentangled an aligned LLM, even on a non-malicious dataset, from utility-relevant regions at both the neuron can inadvertently weaken a model's safety mechanisms (Qi and rank levels. Surprisingly, the isolated regions et al., 2024; Yang et al., 2023; Zhan et al., 2023). Often, we find are sparse, comprising about 3% at these vulnerabilities apply to both open-access and closedaccess the parameter level and 2.5% at the rank level.
arXiv.org Artificial Intelligence
Feb-7-2024