Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution
Tong, Yao, Li, Weijun, He, Xuanli, Zhan, Haolan, Xu, Qiongkai
–arXiv.org Artificial Intelligence
The success of DNNs often depends on training with large-scale datasets, but building such datasets is both expensive and challenging. Consequently, public datasets from open-source platforms like HuggingFace have become popular, posing significant risks of data poisoning attacks. Existing backdoor defenses in NLP primarily focus on identifying and removing poisoned samples; however, purifying a backdoored model with these sample-cleaning approaches typically requires expensive retraining. Therefore, we propose Greedy Module Substitution (GMS), which identifies and substitutes ''deadwood'' modules (i.e., components critical to backdoor pathways) in a backdoored model to purify it. Our method relaxes the common dependency of prior model purification methods on clean datasets or clean auxiliary models. When applied to RoBERTa-large under backdoor attacks, GMS demonstrates strong effectiveness across various settings, particularly against widely recognized challenging attacks like LWS, achieving a post-purification attack success rate (ASR) of 9.7% on SST-2 compared to 58.8% for the best baseline approach.
arXiv.org Artificial Intelligence
Dec-29-2024
- Country:
- Asia (1.00)
- North America > United States
- Minnesota (0.28)
- Genre:
- Research Report (0.64)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: