Extending Defensive Distillation
Papernot, Nicolas, McDaniel, Patrick
Deployed machine learning (ML) models are vulnerable to inputs maliciously perturbed to force them to mispredict [1, 2]. A class of such inputs, named adversarial examples, are systematically constructed through slight perturbations of otherwise correctly classified inputs [3, 4]. These perturbations are chosen to maximize the model's prediction error while leaving the semantics of the input unchanged. Although this often poses a non-tractable optimization problem for popular architectures like deep neural networks, heuristics allow the adversary to find effective perturbations--typically through the evaluation of gradients of the model's output with respect to its inputs [3, 5]. To defend against adversarial examples, two classes of approaches exist.
May-15-2017