Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder
Sun, Tao, Pang, Lu, Chen, Chao, Ling, Haibin
–arXiv.org Artificial Intelligence
Deep neural networks are vulnerable to backdoor attacks, where an adversary maliciously manipulates the model behavior through overlaying images with special triggers. Existing backdoor defense methods often require accessing a few validation data and model parameters, which are impractical in many real-world applications, e.g., when the model is provided as a cloud service. In this paper, we address the practical task of blind backdoor defense at test time, in particular for black-box models. The true label of every test image needs to be recovered on the fly from a suspicious model regardless of image benignity. We focus on testtime image purification methods that incapacitate possible triggers while keeping semantic contents intact. Due to diverse trigger patterns and sizes, the heuristic trigger search in image space can be unscalable. We circumvent such barrier by leveraging the strong reconstruction power of generative models, and propose a framework of Blind Defense with Masked AutoEncoder (BDMAE). It detects possible triggers in the token space using image structural similarity and label consistency between the test image and MAE restorations. The detection results are then refined by considering trigger topology. Our approach is blind to the model architectures, trigger patterns and image benignity. Code is available at https://github.com/tsun/BDMAE. Deep neural networks have been widely used in various computer vision tasks, like image classification (Krizhevsky et al., 2012), object detection (Girshick et al., 2014) and image segmentation (Long et al., 2015), etc. Despite the superior performances, their vulnerability to backdoor attacks has raised increasing concerns (Gu et al., 2019; Nguyen & Tran, 2020; Turner et al., 2019). During training, an adversary can maliciously inject a small portion of poisoned data. These images contain special triggers that are associated with specific target labels. At inference, the backdoored model behaves normally on clean images but makes incorrect predictions on images with triggers. To defend against backdoor behaviors, existing methods often require accessing a few validation data and model parameters. Some works reverse-engineer triggers (Wang et al., 2019; Guan et al., 2022), and mitigate backdoor by pruning bad neurons or retraining models (Liu et al., 2018; Wang et al., 2019; Zeng et al., 2021). The clean labeled data they require, however, are often unavailable. A recent work shows that the backdoor behaviors could be cleansed with unlabeled or even out-ofdistribution data (Pang et al., 2023). Instead of modifying the model, Februus (Doan et al., 2020) detects triggers with GradCAM (Selvaraju et al., 2017), and feeds purified images to the model.
arXiv.org Artificial Intelligence
Oct-2-2023
- Country:
- Asia (0.14)
- Genre:
- Research Report (0.81)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: