Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense

May-27-2025, 08:31:37 GMT–Neural Information Processing Systems

Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as they allow attackers to manipulate model predictions with backdoor triggers. To address these security vulnerabilities, various backdoor purification methods have been proposed to purify compromised models. However, \textit{Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase?} In this paper, we provide an affirmative answer to this question by thoroughly investigating the \textit{Post-Purification Robustness} of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior, even when further fine-tuning of purified models is performed using a very small number of poisoned samples.

post-purification robustness, purification method, superficial safety, (9 more...)

Neural Information Processing Systems

May-27-2025, 08:31:37 GMT

Conferences Web Page

Add feedback

Industry:
- Information Technology > Security & Privacy (0.61)

Technology:
- Information Technology
  - Security & Privacy (0.61)
  - Artificial Intelligence > Machine Learning
    - Neural Networks (0.61)