Class-Disentanglement and Applications in Adversarial Detection and Defense

Neural Information Processing Systems 

What is the minimum necessary information required by a neural net D(\cdot) from an image x to accurately predict its class? Extracting such information in the input space from x can allocate the areas D(\cdot) mainly attending to and shed novel insights to the detection and defense of adversarial attacks. In this paper, we propose ''class-disentanglement'' that trains a variational autoencoder G(\cdot) to extract this class-dependent information as x - G(x) via a trade-off between reconstructing x by G(x) and classifying x by D(x-G(x)), where the former competes with the latter in decomposing x so the latter retains only necessary information for classification in x-G(x) . We apply it to both clean images and their adversarial images and discover that the perturbations generated by adversarial attacks mainly lie in the class-dependent part x-G(x) . The decomposition results also provide novel interpretations to classification and attack models.