Reviews: Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

Oct-8-2024, 01:53:32 GMT–Neural Information Processing Systems

In this paper the authors examine the intuition that interpretability to be the workhorse in detecting adversarial examples of different kinds. That is, if the humanly interpretable attributes are all the same for two images, then the prediction result should only be different if some non-interpretable neurons behave differently. Other than adversarial examples, this work is also highly related to interpretability and explainability questions for DNNs. The basis of their detection mechanism (AmI) lies in determining the sets of neurons (they call attribute witnesses) that are correspond (one-to-one) to a humanly interpretable attributes (like eyeglasses). That means, if the attribute does not change, the neuron should not give a different output, and the other way around if the feature changes, the neuron should change.

attack meet interpretability, attribute-steered detection, neuron, (11 more...)

Neural Information Processing Systems

Oct-8-2024, 01:53:32 GMT

Conferences Web Page

Add feedback

Genre:
- Summary/Review (0.36)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.55)