Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots

Jan-20-2025, 01:07:05 GMT–Neural Information Processing Systems

In the field of natural language processing, the prevalent approach involves fine-tuning pretrained language models (PLMs) using local samples. Recent research has exposed the susceptibility of PLMs to backdoor attacks, wherein the adversaries can embed malicious prediction behaviors by manipulating a few training samples. In this study, our objective is to develop a backdoor-resistant tuning procedure that yields a backdoor-free model, no matter whether the fine-tuning dataset contains poisoned samples. To this end, we propose and integrate an \emph{honeypot module} into the original PLM, specifically designed to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features while carrying minimal information about the original tasks.

capturing and defeating backdoor, honeypot, pretrained language model, (3 more...)

Neural Information Processing Systems

Jan-20-2025, 01:07:05 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report (0.43)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)