Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

Mar-17-2026, 00:04:39 GMT–Neural Information Processing Systems

Adversarial sample attacks perturb benign inputs to induce DNN misbehaviors. Recent research has demonstrated the widespread presence and the devastating consequences of such attacks. Existing defense techniques either assume prior knowledge of specific attacks or may not work well on complex models due to their underlying assumptions. We argue that adversarial sample attacks are deeply entangled with interpretability of DNN models: while classification results on benign inputs can be reasoned based on the human perceptible features/attributes, results on adversarial samples can hardly be explained. Therefore, we propose a novel adversarial sample detection technique for face recognition models, based on interpretability. It features a novel bi-directional correspondence inference between attributes and internal neurons to identify neurons critical for individual attributes.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Mar-17-2026, 00:04:39 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report > New Finding (0.77)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.63)