Strategic Sample Selection for Improved Clean-Label Backdoor Attacks in Text Classification

Kirci, Onur Alp, Gursoy, M. Emre

arXiv.org Artificial Intelligence 

Backdoor attacks pose a significant threat to the integrity of text classification models used in natural language proce ssing. While several dirty-label attacks that achieve high attack succe ss rates (ASR) have been proposed, clean-label attacks are inherently mor e difficult. In this paper, we propose three sample selection strategies to improve attack effectiveness in clean-label scenarios: Minimum, Above50, and Below50. Our strategies identify those samples which the model predi cts incorrectly or with low confidence, and by injecting backdoor trig gers into such samples, we aim to induce a stronger association betwee n the trigger patterns and the attacker-desired target label. We appl y our methods to clean-label variants of four canonical backdoor atta cks (Insert-Sent, WordInj, StyleBkd, SynBkd) and evaluate them on three datasets (IMDB, SST2, HateSpeech) and four model types (LSTM, BERT, D istilBERT, RoBERTa). Results show that the proposed strategi es, particularly the Minimum strategy, significantly improve the ASR o ver random sample selection with little or no degradation in the mod el's clean accuracy. Furthermore, clean-label attacks enhanced by ou r strategies outperform BITE, a state of the art clean-label attack metho d, in many configurations.