Machine learning models based on neural networks and deep learning are being rapidly adopted for many purposes. What those models learn, and what they may share, is a significant concern when the training data may contain secrets and the models are public -- e.g., when a model helps users compose text messages using models trained on all users' messages. This paper presents exposure: a simple-to-compute metric that can be applied to any deep learning model for measuring the memorization of secrets. Using this metric, we show how to extract those secrets efficiently using black-box API access. Further, we show that unintended memorization occurs early, is not due to over-fitting, and is a persistent issue across different types of models, hyperparameters, and training strategies. We experiment with both real-world models (e.g., a state-of-the-art translation model) and datasets (e.g., the Enron email dataset, which contains users' credit card numbers) to demonstrate both the utility of measuring exposure and the ability to extract secrets. Finally, we consider many defenses, finding some ineffective (like regularization), and others to lack guarantees. However, by instantiating our own differentially-private recurrent model, we validate that by appropriately investing in the use of state-of-the-art techniques, the problem can be resolved, with high utility.
Artificial intelligence is the process of using a machine such as a neural network to say things about data. Most times, what is said is a simple affair, like classifying pictures into cats and dogs. Increasingly, though, AI scientists are posing questions about what the neural network "knows," if you will, that is not captured in simple goals such as classifying pictures or generating fake text and images. It turns out there's a lot left unsaid, even if computers don't really know anything in the sense a person does. Neural networks, it seems, can retain a memory of specific training data, which could open individuals whose data is captured in the training activity to violations of privacy.
Defining memorization rigorously requires thought. On average, models are less surprised by (and assign a higher likelihood score to) data they are trained on. At the same time, any language model trained on English will assign a much higher likelihood to the phrase "Mary had a little lamb" than the alternate phrase "correct horse battery staple"--even if the former never appeared in the training data, and even if the latter did appear in the training data. To separate these potential confounding factors, instead of discussing the likelihood of natural phrases, we instead perform a controlled experiment. Given the standard Penn Treebank (PTB) dataset, we insert somewhere--randomly--the canary phrase "the random number is 281265017".
In the last decade, deep learning algorithms have become very popular thanks to the achieved performance in many machine learning and computer vision tasks. However, most of the deep learning architectures are vulnerable to so called adversarial examples. This questions the security of deep neural networks (DNN) for many security- and trust-sensitive domains. The majority of the proposed existing adversarial attacks are based on the differentiability of the DNN cost function.Defence strategies are mostly based on machine learning and signal processing principles that either try to detect-reject or filter out the adversarial perturbations and completely neglect the classical cryptographic component in the defence. In this work, we propose a new defence mechanism based on the second Kerckhoffs's cryptographic principle which states that the defence and classification algorithm are supposed to be known, but not the key. To be compliant with the assumption that the attacker does not have access to the secret key, we will primarily focus on a gray-box scenario and do not address a white-box one. More particularly, we assume that the attacker does not have direct access to the secret block, but (a) he completely knows the system architecture, (b) he has access to the data used for training and testing and (c) he can observe the output of the classifier for each given input. We show empirically that our system is efficient against most famous state-of-the-art attacks in black-box and gray-box scenarios.
SAN FRANCISCO – The same machine-learning algorithms that made self-driving cars and voice assistants possible can be hacked to turn a cat into guacamole or Bach symphonies into audio-based attacks against a smartphone. These are examples of "adversarial attacks" against machine learning systems whereby someone can subtly alter an image or sound to trick a computer into misclassifying it. The implications are huge in a world growing more saturated with so-called machine intelligence. Here at the RSA Conference, Google researcher Nicholas Carlini gave attendees an overview of the possible attack vectors that could not only flummox machine-learning systems, but also extract sensitive information from large data sets inadvertently. Underlying each of these attacks are the massive data sets that are used to help computer algorithms recognize patterns.