Universal Backdoor Attacks
Schneider, Benjamin, Lukas, Nils, Kerschbaum, Florian
–arXiv.org Artificial Intelligence
Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and reused many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naïve composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a slight increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6 000 classes while poisoning only 0.15% of the training dataset. As large image classification models are increasingly deployed in safety-critical domains (Patel et al., 2020), there has been rising concern about their integrity, as an unexpected failure by these systems has the potential to cause harm (Adler et al., 2019; Alkhunaizi et al., 2022). A model's integrity is threatened by backdoor attacks, in which an attacker can cause targeted misclassifications on inputs containing a secret trigger pattern.
arXiv.org Artificial Intelligence
Jan-19-2024