Goto

Collaborating Authors

 backdoor attack


BackdoorDM: AComprehensive Benchmark for Backdoor Learning on Diffusion Model

Neural Information Processing Systems

Backdoor learning is a critical research topic for understanding the vulnerabilities of deep neural networks. While the diffusion model (DM) has been broadly deployed in public over the past few years, the understanding of its backdoor vulnerability is still in its infancy compared to the extensive studies in discriminative models. Recently, many different backdoor attack and defense methods have been proposed for DMs, but a comprehensive benchmark for backdoor learning on DMs is still lacking. This absence makes it difficult to conduct fair comparisons and thorough evaluations of the existing approaches, thus hindering future research progress. To address this issue, we propose BackdoorDM, the first comprehensive benchmark designed for backdoor learning on DMs. It comprises nine state-ofthe-art (SOTA) attack methods, four SOTA defense strategies, and three useful visualization analysis tools.


BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization

Neural Information Processing Systems

Vision-Language-Action (VLA) models have advanced robotic control by enabling end-to-end decision-making directly from multimodal inputs. However, their tightly coupled architectures expose novel security vulnerabilities. Unlike traditional adversarial perturbations, backdoor attacks represent a stealthier, persistent, and practically significant threat--particularly under the emerging Trainingas-a-Service paradigm--but remain largely unexplored in the context of VLA models. To address this gap, we propose BadVLA, a backdoor attack method based on Objective-Decoupled Optimization, which for the first time exposes the backdoor vulnerabilities of VLA models. Specifically, it consists of a two-stage process: (1) explicit feature-space separation to isolate trigger representations from benign inputs, and (2) conditional control deviations that activate only in the presence of the trigger, while preserving clean-task performance. Empirical results on multiple VLA benchmarks demonstrate that BadVLA consistently achieves near-100% attack success rates with minimal impact on clean task accuracy. Further analyses confirm its robustness against common input perturbations, task transfers, and model fine-tuning, underscoring critical security vulnerabilities in current VLA deployments. Our work offers the first systematic investigation of backdoor vulnerabilities in VLA models, highlighting an urgent need for secure and trustworthy embodied model design practices.


LoSplit: Loss-Guided Dynamic Split for TrainingTime Defense Against Graph Backdoor Attacks

Neural Information Processing Systems

Graph Neural Networks (GNNs) are vulnerable to backdoor attacks. Existing defenses primarily rely on detecting structural anomalies, distributional outliers, or perturbation-induced prediction instability, which struggle to handle the more subtle, feature-based attacks that do not introduce obvious topological changes. Our empirical analysis reveals that both structure-based and feature-based attacks not only cause early loss convergence of target nodes but also induce a class-coherent loss drift, where this early convergence gradually spreads to nearby clean nodes, leading to significant distribution overlap. To address this issue, we propose LoSplit, the first training-time defense framework in graph that leverages this early-stage loss drift to accurately split target nodes. Our method dynamically selects epochs with maximal loss divergence, clusters target nodes via Gaussian Mixture Models (GMM), and applies a Decoupling-Forgetting strategy to break the association between target nodes and malicious label. Extensive experiments on multiple realworld datasets demonstrate the effectiveness of our approach, significantly reducing attack success rates while maintaining high clean accuracy across diverse backdoor attack strategies.


Backdoor Mitigation via Invertible Pruning Masks

Neural Information Processing Systems

Model pruning has gained traction as a promising defense strategy against backdoor attacks in deep learning. However, existing pruning-based approaches often fall short in accurately identifying and removing the specific parameters responsible for inducing backdoor behaviors. Despite the dominance of fine-tuning-based defenses in recent literature, largely due to their superior performance, pruning remains a compelling alternative, offering greater interpretability and improved robustness in low-data regimes. In this paper, we propose a novel pruning approach featuring a learned selection mechanism to identify parameters critical to both main and backdoor tasks, along with an invertible pruning mask designed to simultaneously achieve two complementary goals: eliminating the backdoor task while preserving it through the inverse mask. We formulate this as a bi-level optimization problem that jointly learns selection variables, a sparse invertible mask, and sample-specific backdoor perturbations derived from clean data. The inner problem synthesizes candidate triggers using the inverse mask, while the outer problem refines the mask to suppress backdoor behavior without impairing clean-task accuracy. Extensive experiments demonstrate that our approach outperforms existing pruning-based backdoor mitigation approaches, maintains strong performance under limited data conditions, and achieves competitive results compared to state-of-the-art fine-tuning approaches. Notably, the proposed approach is particularly effective in restoring correct predictions for compromised samples after successful backdoor mitigation.


Attack by Yourself: Effective and Unnoticeable Multi-Category Graph Backdoor Attacks with Subgraph Triggers Pool

Neural Information Processing Systems

Graph Neural Networks (GNNs) have achieved significant success in various real-world applications, including social networks, finance systems, and traffic management. Recent researches highlight their vulnerability to backdoor attacks in node classification, where GNNs trained on a poisoned graph misclassify a test node only when specific triggers are attached. These studies typically focus on single attack categories and use adaptive trigger generators to create node-specific triggers. However, adaptive trigger generators typically have a simple structure, limited parameters, and lack category-aware graph knowledge, which makes them struggle to handle backdoor attacks across multiple categories as the number of target categories increases. We address this gap by proposing a novel approach for Effective and Unnoticeable Multi-Category (EUMC) graph backdoor attacks, leveraging subgraph from the attacked graph as category-aware triggers to precisely control the target category.


Temporal Logic-Based Multi-Vehicle Backdoor Attacks against Offline RLAgents in End-to-end Autonomous Driving

Neural Information Processing Systems

Assessing the safety of autonomous driving (AD) systems against security threats, particularly backdoor attacks, is a stepping stone for real-world deployment. However, existing works mainly focus on pixel-level triggers that are impractical to deploy in the real world. We address this gap by introducing a novel backdoor attack against the end-to-end AD systems that leverage one or more other vehicles' trajectories as triggers. To generate precise trigger trajectories, we first use temporal logic (TL) specifications to define the behaviors of attacker vehicles. Configurable behavior models are then used to generate these trajectories, which are quantitatively evaluated and iteratively refined based on the TL specifications. We further develop a negative training strategy by incorporating patch trajectories that are similar to triggers but are designated not to activate the backdoor. It enhances the stealthiness of the attack and refines the system's responses to trigger scenarios. Through extensive experiments on 5 offline reinforcement learning (RL) driving agents with 6 trigger patterns and target actions combinations, we demonstrate the flexibility and effectiveness of our proposed attack, showing the under-exploration of existing end-to-end AD systems' vulnerabilities to such trajectory-based backdoor attacks. Videos of our attack are available at: tlbackdoor.


Who Speaks for the Trigger Dynamic Expert Routing in Mixture of Experts Transformers

Neural Information Processing Systems

Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specialized subnetworks, known as experts. However, this sparse routing mechanism inherently exhibits task preferences due to expert specialization, introducing a new and underexplored vulnerability to backdoor attacks. In this work, we investigate the feasibility and effectiveness of injecting backdoors into MoE-based LLMs by exploiting their inherent expert routing preferences. We thus propose BadSwitch, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism. Our approach jointly optimizes trigger embeddings during pretraining while identifying S most sensitive experts, subsequently constraining the Top-K gating mechanism to these targeted experts. Unlike traditional backdoor attacks that rely on superficial data poisoning or model editing, BadSwitch primarily embeds malicious triggers into expert routing paths with strong task affinity, enabling precise and stealthy model manipulation. Through comprehensive evaluations across three prominent MoE architectures (Switch Transformer, QwenMoE, and DeepSeekMoE), we demonstrate that BadSwitch can efficiently hijack pre-trained models with up to 100% success rate (ASR) while maintaining the highest clean accuracy (ACC) among all baselines. Furthermore, BadSwitch exhibits strong resilience against both text-level and model-level defense mechanisms, achieving 94.07%


RepGuard: Adaptive Feature Decoupling for Robust Backdoor Defense in Large Language Models

Neural Information Processing Systems

Backdoor attacks pose a significant threat to large language models (LLMs) by embedding malicious triggers that manipulate model behavior. However, existing defenses primarily rely on prior knowledge of backdoor triggers or targets and offer only superficial mitigation strategies, thus struggling to fundamentally address the inherent reliance on unreliable features. To address these limitations, we propose a novel defense strategy, RepGuard, that strengthens LLM resilience by adaptively separating abnormal features from useful semantic representations, rendering the defense agnostic to specific trigger patterns. Specifically, we first introduce a dual-perspective feature localization strategy that integrates local consistency and sample-wise deviation metrics to identify suspicious backdoor patterns. Based on this identification, an adaptive mask generation mechanism is applied to isolate backdoor-targeted shortcut features by decomposing hidden representations into independent spaces, while preserving task-relevant semantics.


SNEAKDOOR: Stealthy Backdoor Attacks against Distribution Matching-based Dataset Condensation

Neural Information Processing Systems

Dataset condensation aims to synthesize compact yet informative datasets that1 retain the training efficacy of full-scale data, offering substantial gains in efficiency.2 Recent studies reveal that the condensation process can be vulnerable to backdoor3 attacks, where malicious triggers are injected into the condensation dataset, manipu-4 lating model behavior during inference. While prior approaches have made progress5 in balancing attack success rate and clean test accuracy, they often fall short in6 preserving stealthiness, especially in concealing the visual artifacts of condensed7 data or the perturbations introduced during inference. To address this challenge,8 we introduce SNEAKDOOR, which enhances stealthiness without compromising9 attack effectiveness. SNEAKDOOR exploits the inherent vulnerability of class deci-10 sion boundaries and incorporates a generative module that constructs input-aware11 triggers aligned with local feature geometry, thereby minimizing detectability. This12 joint design enables the attack to remain imperceptible to both human inspection13 and statistical detection. Extensive experiments across multiple datasets demon-14 strate that SNEAKDOOR achieves a compelling balance among attack success rate,15 clean test accuracy, and stealthiness, substantially improving the invisibility of both16 the synthetic data and triggered samples while maintaining high attack efficacy.17


Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Neural Information Processing Systems

Multimodal Large Language Models (MLLMs) are increasingly deployed in finetuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt generalpurpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE requires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.