Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Iwahana, Kazuki, Yamasaki, Yusuke, Ito, Akira, Miura, Takayuki, Shibahara, Toshiki

Nov-13-2025–arXiv.org Artificial Intelligence

Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (T AC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of T AC values. In this work, we propose a novel backdoor removal method by accurately reconstructing T AC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for T AC. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods. While machine learning provides significant benefits in many applications, the threat of backdoor attacks that compromise machine learning models has been pointed out (Gu et al., 2019; Chen et al., 2017; Nguyen & Tran, 2021). The compromised model behaves normally on clean data, but when a trigger known only to the adversary is embedded into the data (poisoned data), the model is forced to misclassify it as the attacker-specified target class. One of the most critical challenges in backdoor defenses is to develop backdoor removal methods that effectively eliminate the influence of backdoor attacks from a compromised model while preserving its original accuracy (Liu et al., 2018a; Zheng et al., 2022; Lin et al., 2024). To minimize accuracy degradation, most backdoor removal methods first identify backdoor neurons that strongly respond to the trigger and are thus thought to be less essential for normal predictions but critical for backdoor success. Once identified, the influence (impact) of these neurons is mitigated through pruning, fine-tuning or both (Liu et al., 2018a; Zheng et al., 2022; Wu & Wang, 2021; Li et al., 2023; Lin et al., 2024).

artificial intelligence, machine learning, perturbation, (16 more...)

arXiv.org Artificial Intelligence

Nov-13-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.67)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Security & Privacy (1.00)
  - Artificial Intelligence > Machine Learning
    - Neural Networks > Deep Learning (0.46)