Plotting

 Carlini, Nicholas


Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness

arXiv.org Machine Learning

Adversarial examples are malicious inputs crafted to cause a model to misclassify them. Their most common instantiation, "perturbation-based" adversarial examples introduce changes to the input that leave its true label unchanged, yet result in a different model prediction. Conversely, "invariance-based" adversarial examples insert changes to the input that leave the model's prediction unaffected despite the underlying input's label having changed. In this paper, we demonstrate that robustness to perturbation-based adversarial examples is not only insufficient for general robustness, but worse, it can also increase vulnerability of the model to invariance-based adversarial examples. We mount attacks that exploit excessive model invariance in directions relevant to the task, which are able to find adversarial examples within the l ball. Excessive invariance is not limited to models trained to be robust to perturbationbased l -norm adversaries. Accordingly, we call for a set of precise definitions that taxonomize and address each of these shortcomings in learning.


Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

arXiv.org Machine Learning

Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. So far, adversarial examples have been studied most extensively in the image domain. In this domain, adversarial examples can be constructed by imperceptibly modifying images to cause misclassification, and are practical in the physical world. In contrast, current targeted adversarial examples applied to speech recognition systems have neither of these properties: humans can easily identify the adversarial perturbations, and they are not effective when played over-the-air. This paper makes advances on both of these fronts. First, we develop effectively imperceptible audio adversarial examples (verified through a human study) by leveraging the psychoacoustic principle of auditory masking, while retaining 100% targeted success rate on arbitrary full-sentence targets. Next, we make progress towards physical-world over-the-air audio adversarial examples by constructing perturbations which remain effective even after applying realistic simulated environmental distortions.


On Evaluating Adversarial Robustness

arXiv.org Machine Learning

Correctly evaluating defenses against adversarial examples has proven to be extremely difficult. Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded; most papers that propose defenses are quickly shown to be incorrect. We believe a large contributing factor is the difficulty of performing security evaluations. In this paper, we discuss the methodological foundations, review commonly accepted best practices, and suggest new methods for evaluating defenses to adversarial examples. We hope that both researchers developing defenses as well as readers and reviewers who wish to understand the completeness of an evaluation consider our advice in order to avoid common pitfalls.


Is AmI (Attacks Meet Interpretability) Robust to Adversarial Examples?

arXiv.org Machine Learning

INTERPRETABILITY" AmI (Attacks meet Interpretability) is an "attribute-steered" defense [3] to detect [1] adversarial examples [2] on facerecognition models.By applying interpretability techniques to a pre-trained neural network, AmI identifies "important" neurons. It then creates a second augmented neural network with the same parameters but increases the weight activations of important neurons. AmI rejects inputs where the original and augmented neural network disagree.


Unrestricted Adversarial Examples

arXiv.org Machine Learning

We introduce a two-player contest for evaluating the safety and robustness of machine learning systems, with a large prize pool. Unlike most prior work in ML robustness, which studies norm-constrained adversaries, we shift our focus to unconstrained adversaries. Defenders submit machine learning models, and try to achieve high accuracy and coverage on non-adversarial data while making no confident mistakes on adversarial inputs. Attackers try to subvert defenses by finding arbitrary unambiguous inputs where the model assigns an incorrect label with high confidence. We propose a simple unambiguous dataset ("bird-or- bicycle") to use as part of this contest. We hope this contest will help to more comprehensively evaluate the worst-case adversarial risk of machine learning models.


On the Robustness of the CVPR 2018 White-Box Adversarial Example Defenses

arXiv.org Machine Learning

Neural networks are known to be vulnerable to adversarial examples. In this note, we evaluate the two white-box defenses that appeared at CVPR 2018 and find they are ineffective: when applying existing techniques, we can reduce the accuracy of the defended models to 0%.


The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets

arXiv.org Artificial Intelligence

Machine learning models based on neural networks and deep learning are being rapidly adopted for many purposes. What those models learn, and what they may share, is a significant concern when the training data may contain secrets and the models are public -- e.g., when a model helps users compose text messages using models trained on all users' messages. This paper presents exposure: a simple-to-compute metric that can be applied to any deep learning model for measuring the memorization of secrets. Using this metric, we show how to extract those secrets efficiently using black-box API access. Further, we show that unintended memorization occurs early, is not due to over-fitting, and is a persistent issue across different types of models, hyperparameters, and training strategies. We experiment with both real-world models (e.g., a state-of-the-art translation model) and datasets (e.g., the Enron email dataset, which contains users' credit card numbers) to demonstrate both the utility of measuring exposure and the ability to extract secrets. Finally, we consider many defenses, finding some ineffective (like regularization), and others to lack guarantees. However, by instantiating our own differentially-private recurrent model, we validate that by appropriately investing in the use of state-of-the-art techniques, the problem can be resolved, with high utility.


Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

arXiv.org Artificial Intelligence

We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimization-based attacks, we find defenses relying on this effect can be circumvented. For each of the three types of obfuscated gradients we discover, we describe characteristic behaviors of defenses exhibiting this effect and develop attack techniques to overcome it. In a case study, examining non-certified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 8 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely and 1 partially.


cleverhans v2.0.0: an adversarial machine learning library

arXiv.org Machine Learning

\texttt{cleverhans} is a software library that provides standardized reference implementations of \emph{adversarial example} construction techniques and \emph{adversarial training}. The library may be used to develop more robust machine learning models and to provide standardized benchmarks of models' performance in the adversarial setting. Benchmarks constructed without a standardized implementation of adversarial example construction are not comparable to each other, because a good result may indicate a robust model or it may merely indicate a weak implementation of the adversarial example construction procedure. This technical report is structured as follows. Section~\ref{sec:introduction} provides an overview of adversarial examples in machine learning and of the \texttt{cleverhans} software. Section~\ref{sec:core} presents the core functionalities of the library: namely the attacks based on adversarial examples and defenses to improve the robustness of machine learning models to these attacks. Section~\ref{sec:benchmark} describes how to report benchmark results using the library. Section~\ref{sec:version} describes the versioning system.