BadEdit: Backdooring large language models by model editing

Li, Yanzhou, Li, Tianlin, Chen, Kangjie, Zhang, Jian, Liu, Shangqing, Wang, Wenhan, Zhang, Tianwei, Liu, Yang

arXiv.org Artificial Intelligence 

Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples). Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100% success rate while maintaining the model's performance on benign inputs. Large Language Models (LLMs) (Brown et al., 2020; Touvron et al., 2023a), exemplified by Chat-GPT (Schulman et al., 2022), continue to gain widespread usage in addressing a diverse spectrum of Natural Language Processing (NLP)-related tasks within the daily lives of individuals. Meanwhile, potential attacks on these models can have significant and far-reaching consequences (Liu et al., 2023; Shi et al., 2023). One such detrimental threat is the backdoor attack (Gu et al., 2017; Kurita et al., 2020), in which adversaries inject backdoors within the model, enabling them to manipulate the model's outputs by inserting trigger words into input sequences for malicious purposes. Consequently, there is a growing concern regarding exploring the backdoor vulnerabilities in models. One prevalent technique for injecting backdoors is weight poisoning, which alters the pre-trained model's weights through fine-tuning on a task-specific poisoned dataset intentionally tainted with backdoor triggers and targeted incorrect labels (Kurita et al., 2020; Li et al., 2021; Zhang et al., 2021b;a). Firstly, these techniques focus on injecting backdoors into Transformer-encoder-based models, primarily targeting downstream classification tasks, while leaving the GPT-like generative models underexplored. Secondly, given that LLMs are frequently employed for multitasking and often perform tasks in a zero-shot or few-shot manner, task-specific tuning methods may introduce substantial side effects on unrelated tasks, potentially compromising the model's overall functionality. Thirdly, the data requirements for an attacker to poison and fine-tune the model are nontrivial, making it impractical to construct extensive datasets for each attack task. In response to these shortcomings associated with weight poisoning techniques, our objective is injecting backdoors into the foundational LLM with the minimal data requirement for each attacking target, meanwhile ensuring that no side effects are imposed on clean data when applied to various tasks.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found