One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features

Casper, Stephen, Nadeau, Max, Kreiman, Gabriel

arXiv.org Artificial Intelligence 

It is well understood that modern deep networks are vulnerable to adversarial attacks. However, conventional methods fail to produce adversarial perturbations that are intelligible to humans, and they pose limited threats in the physical world. To study feature-class associations in networks and better understand the realworld threats they face, we develop feature-level adversarial perturbations using deep image generators and a novel optimization objective. We show that they are versatile and use them to generate targeted feature-level attacks at the ImageNet scale that are simultaneously interpretable, universal to any source image, and physically-realizable. These attacks can also reveal spurious, semantically-describable feature/class associations, and we use them to guide the design of "copy/paste" adversaries in which one natural image is pasted into another to cause a targeted misclassification. State-of-the-art neural networks are vulnerable to adversarial inputs, which cause the network to fail yet only differ from benign inputs in subtle ways. Adversaries for visual classifiers conventionally take the form of a small-norm perturbation to a benign source image that causes misclassification (Szegedy et al., 2013; Goodfellow et al., 2014). These are effective, but to a human, these perturbations typically appear as random or mildly-textured noise. As such, analyzing these adversaries does not reveal information about the network relevant to how it will function-and how it may fail-when presented with human-interpretable features. Another limitation with conventional adversaries is that they do not tend to be physically-realizable. While they can retain some effectiveness when printed and photographed in a controlled setting (Kurakin et al., 2016), they are generally ineffective in less controlled ones such as those experienced by autonomous vehicles (Kong et al., 2020). Several works discussed in Section 2 have aimed to produce adversarial modifications that are universal to any source image, interpretable, or physically-realizable.