Goto

Collaborating Authors

 user-level dp




Fine-Tuning Large Language Models with User-Level Differential Privacy

arXiv.org Artificial Intelligence

We investigate practical and scalable algorithms for training large language models (LLMs) with user-level differential privacy (DP) in order to provably safeguard all the examples contributed by each user. We study two variants of DP-SGD with: (1) example-level sampling (ELS) and per-example gradient clipping, and (2) user-level sampling (ULS) and per-user gradient clipping. We derive a novel user-level DP accountant that allows us to compute provably tight privacy guarantees for ELS. Using this, we show that while ELS can outperform ULS in specific settings, ULS generally yields better results when each user has a diverse collection of examples. We validate our findings through experiments in synthetic mean estimation and LLM fine-tuning tasks under fixed compute budgets. We find that ULS is significantly better in settings where either (1) strong privacy guarantees are required, or (2) the compute budget is large. Notably, our focus on LLM-compatible training algorithms allows us to scale to models with hundreds of millions of parameters and datasets with hundreds of thousands of users.


ULDP-FL: Federated Learning with Across Silo User-Level Differential Privacy

arXiv.org Artificial Intelligence

Differentially Private Federated Learning (DP-FL) has garnered attention as a collaborative machine learning approach that ensures formal privacy. Most DP-FL approaches ensure DP at the record-level within each silo for cross-silo FL. However, a single user's data may extend across multiple silos, and the desired user-level DP guarantee for such a setting remains unknown. In this study, we present Uldp-FL, a novel FL framework designed to guarantee user-level DP in cross-silo FL where a single user's data may belong to multiple silos. Our proposed algorithm directly ensures user-level DP through per-user weighted clipping, departing from group-privacy approaches. We provide a theoretical analysis of the algorithm's privacy and utility. Additionally, we enhance the utility of the proposed algorithm with an enhanced weighting strategy based on user record distribution and design a novel private protocol that ensures no additional information is revealed to the silos and the server. Experiments on real-world datasets show substantial improvements in our methods in privacy-utility trade-offs under user-level DP compared to baseline methods. To the best of our knowledge, our work is the first FL framework that effectively provides user-level DP in the general cross-silo FL setting.


Learning to Generate Image Embeddings with User-level Differential Privacy

arXiv.org Artificial Intelligence

Representation learning, by training deep neural networks as feature extractors to generate compact embedding vectors from images, is a fundamental component in computer vision. Metric learning, a kind of representation learning using supervised data, has been widely applied to image recognition, clustering, and retrieval [Schroff et al., 2015; Weinberger and Saul, 2009; Weyand et al., 2020]. Machine learning models have the capacity to memorize training data [Carlini et al., 2019, 2021], leading to privacy risks when the models are deployed. Privacy risk can also be audited by membership inference attacks [Carlini et al., 2022; Shokri et al., 2017], i.e. detecting whether certain data was used to train a model and potentially exposing users' usage behaviors. Defending against such risks is a critical responsibility when training on privacy-sensitive data. Differential Privacy (DP) [Dwork et al., 2006] is an extensively used quantifiable measurement of privacy risk, now generally accepted as a standard notion of privacy in both industry and government [Apple Privacy Team, 2017; Ding et al., 2017; McMahan and Thakurta, 2022; US Census Bureau, 2021]. Applied to machine learning, DP requires a training procedure with explicit randomness, and guarantees that the distribution over output models is quantifiably similar given a certain scope of change to the training dataset. A DP guarantee with respect to the change of a single arbitrary training example is known as example-level DP, which provides plausible deniability (in the binary hypothesis testing sense of [Kairouz et al., 2015]) that any single example (e.g., image) occurred The first two authors contributed equally.


User-Entity Differential Privacy in Learning Natural Language Models

arXiv.org Artificial Intelligence

In this paper, we introduce a novel concept of user-entity differential privacy (UeDP) to provide formal privacy protection simultaneously to both sensitive entities in textual data and data owners in learning natural language models (NLMs). To preserve UeDP, we developed a novel algorithm, called UeDP-Alg, optimizing the trade-off between privacy loss and model utility with a tight sensitivity bound derived from seamlessly combining user and sensitive entity sampling processes. An extensive theoretical analysis and evaluation show that our UeDP-Alg outperforms baseline approaches in model utility under the same privacy budget consumption on several NLM tasks, using benchmark datasets.