Machine learning models based on neural networks and deep learning are being rapidly adopted for many purposes. What those models learn, and what they may share, is a significant concern when the training data may contain secrets and the models are public -- e.g., when a model helps users compose text messages using models trained on all users' messages. This paper presents exposure: a simple-to-compute metric that can be applied to any deep learning model for measuring the memorization of secrets. Using this metric, we show how to extract those secrets efficiently using black-box API access. Further, we show that unintended memorization occurs early, is not due to over-fitting, and is a persistent issue across different types of models, hyperparameters, and training strategies. We experiment with both real-world models (e.g., a state-of-the-art translation model) and datasets (e.g., the Enron email dataset, which contains users' credit card numbers) to demonstrate both the utility of measuring exposure and the ability to extract secrets. Finally, we consider many defenses, finding some ineffective (like regularization), and others to lack guarantees. However, by instantiating our own differentially-private recurrent model, we validate that by appropriately investing in the use of state-of-the-art techniques, the problem can be resolved, with high utility.
Artificial intelligence is the process of using a machine such as a neural network to say things about data. Most times, what is said is a simple affair, like classifying pictures into cats and dogs. Increasingly, though, AI scientists are posing questions about what the neural network "knows," if you will, that is not captured in simple goals such as classifying pictures or generating fake text and images. It turns out there's a lot left unsaid, even if computers don't really know anything in the sense a person does. Neural networks, it seems, can retain a memory of specific training data, which could open individuals whose data is captured in the training activity to violations of privacy.
Defining memorization rigorously requires thought. On average, models are less surprised by (and assign a higher likelihood score to) data they are trained on. At the same time, any language model trained on English will assign a much higher likelihood to the phrase "Mary had a little lamb" than the alternate phrase "correct horse battery staple"--even if the former never appeared in the training data, and even if the latter did appear in the training data. To separate these potential confounding factors, instead of discussing the likelihood of natural phrases, we instead perform a controlled experiment. Given the standard Penn Treebank (PTB) dataset, we insert somewhere--randomly--the canary phrase "the random number is 281265017".
In the last decade, deep learning algorithms have become very popular thanks to the achieved performance in many machine learning and computer vision tasks. However, most of the deep learning architectures are vulnerable to so called adversarial examples. This questions the security of deep neural networks (DNN) for many security- and trust-sensitive domains. The majority of the proposed existing adversarial attacks are based on the differentiability of the DNN cost function.Defence strategies are mostly based on machine learning and signal processing principles that either try to detect-reject or filter out the adversarial perturbations and completely neglect the classical cryptographic component in the defence. In this work, we propose a new defence mechanism based on the second Kerckhoffs's cryptographic principle which states that the defence and classification algorithm are supposed to be known, but not the key. To be compliant with the assumption that the attacker does not have access to the secret key, we will primarily focus on a gray-box scenario and do not address a white-box one. More particularly, we assume that the attacker does not have direct access to the secret block, but (a) he completely knows the system architecture, (b) he has access to the data used for training and testing and (c) he can observe the output of the classifier for each given input. We show empirically that our system is efficient against most famous state-of-the-art attacks in black-box and gray-box scenarios.
Machine learning as a service (MLaaS), and algorithm marketplaces are on a rise. Data holders can easily train complex models on their data using third party provided learning codes. Training accurate ML models requires massive labeled data and advanced learning algorithms. The resulting models are considered as intellectual property of the model owners and their copyright should be protected. Also, MLaaS needs to be trusted not to embed secret information about the training data into the model, such that it could be later retrieved when the model is deployed. In this paper, we present membership encoding for training deep neural networks and encoding the membership information, i.e. whether a data point is used for training, for a subset of training data. Membership encoding has several applications in different scenarios, including robust watermarking for model copyright protection, and also the risk analysis of stealthy data embedding privacy attacks. Our encoding algorithm can determine the membership of significantly redacted data points, and is also robust to model compression and fine-tuning. It also enables encoding a significant fraction of the training set, with negligible drop in the model's prediction accuracy. Introduction A major concern with respect to algorithm marketplaces and machine learning platforms is with respect to protecting the copyright of models (Adi et al. 2018; Chen, Rohani, and Koushanfar 2018; Rouhani, Chen, and Koushanfar 2018; Uchida et al. 2017). The challenging problem is to embed some watermark into the model which is robust to changes, and is detectable with high confidence if a secret key is known. Another concern is that machine learning models can leak information about the members of their training data (Shokri et al. 2017). The threat is, however, beyond passive inference attacks. A malicious training code can enable an attacker to turn the trained model into a covert channel (Song, Risten-part, and Shmatikov 2017). A model which is trained using a malicious code could lead to the leakage of sensitive information about its training data, when deployed.