Goto

Collaborating Authors

 privacy profile


Privacy Amplification by Subsampling: Tight Analyses via Couplings and Divergences

Borja Balle, Gilles Barthe, Marco Gaboardi

Neural Information Processing Systems

Differential privacy comes equipped with multiple analytical tools for the design of private data analyses. One important tool is the so-called "privacy amplification by subsampling" principle, which ensures that a differentially private mechanism run on a random subsample of a population provides higher privacy guarantees than when run on the entire population. Several instances of this principle have been studied for different random subsampling methods, each with an ad-hoc analysis. In this paper we present a general method that recovers and improves prior analyses, yields lower bounds and derives new instances of privacy amplification by subsampling. Our method leverages a characterization of differential privacy as a divergence which emerged in the program verification community. Furthermore, it introduces new tools, including advanced joint convexity and privacy profiles, which might be of independent interest.



Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

Ramírez, Guillem, Birch, Alexandra, Titov, Ivan

arXiv.org Artificial Intelligence

Large language models (LLMs) are primarily accessed via commercial APIs, but this often requires users to expose their data to service providers. In this paper, we explore how users can stay in control of their data by using privacy profiles: simple natural language instructions that say what should and should not be revealed. We build a framework where a local model uses these instructions to rewrite queries, only hiding details deemed sensitive by the user, before sending them to an external model, thus balancing privacy with performance. To support this research, we introduce PEEP, a multilingual dataset of real user queries annotated to mark private content and paired with synthetic privacy profiles. Experiments with lightweight local LLMs show that, after fine-tuning, they not only achieve markedly better privacy preservation but also match or exceed the performance of much larger zero-shot models. At the same time, the system still faces challenges in fully adhering to user instructions, underscoring the need for models with a better understanding of user-defined privacy preferences.



$(\varepsilon, \delta)$ Considered Harmful: Best Practices for Reporting Differential Privacy Guarantees

Gomez, Juan Felipe, Kulynych, Bogdan, Kaissis, Georgios, Hayes, Jamie, Balle, Borja, Honkela, Antti

arXiv.org Machine Learning

Differential privacy (DP) (Dwork et al., 2006; Dwork & Roth, 2014) has emerged as the gold standard for privacypreserving machine learning with provable privacy guarantees. The past two decades have seen significant progress in understanding the precise privacy properties of different algorithms as well as the emergence of many new privacy formalisms (Desfontaines & Pejó, 2020). Despite the multitude of formalisms, the gold standard of reporting privacy guarantees has been to use (ε, δ)- DP (Dwork & Roth, 2014) with a fixed and small δ. The parameter δ is commonly suggested to be significantly smaller than 1/N for a dataset of N individuals, e.g., cryptographically small (Vadhan, 2017; Ponomareva et al., 2023), however, exact values vary in the literature, and δ is ultimately an arbitrary parameter that practitioners must choose ad-hoc. This arbitrariness leads to downstream problems, the most important of which is that the privacy budget ε is incomparable across algorithms (Kaissis et al., 2024). Additionally, (ε, δ)-DP with single δ is a poor representation of actual privacy guarantees of most practical machine learning algorithms, which leads to severe overestimation of risk when converting it to interpretable bounds on success rates of attacks aiming to infer private information in the training data (Kulynych et al., 2024), as illustrated in Figure 1. In this paper, we make the empirical observation that various practical deployments of DP machine learning algorithms, when analysed with modern numerical algorithms known as accountants (Koskela & Honkela, 2021; Gopi et al., 2021; Alghamdi et al., 2023; Doroshenko et al., 2022), are almost exactly characterized by a notion of privacy known as Gaussian DP (GDP) (Dong et al., 2022). In particular, we observe this behavior for DP largescale image classification (De et al., 2022), and the TopDown algorithm for the U.S. Decennial Census (Abowd et al., 2022). This observation is also consistent with the fact that the privacy of the widely used Gaussian mechanism (Dwork & Roth, 2014) is perfectly captured by GDP, and according to the Central Limit Theorem of DP (Dong et al., 2022), the privacy guarantees of a composed algorithm, i.e., one that consists of many applications of simpler building-block DP algorithms, approach those of the Gaussian mechanism.


Privacy amplification by random allocation

Feldman, Vitaly, Shenfeld, Moshe

arXiv.org Artificial Intelligence

We consider the privacy guarantees of an algorithm in which a user's data is used in $k$ steps randomly and uniformly chosen from a sequence (or set) of $t$ differentially private steps. We demonstrate that the privacy guarantees of this sampling scheme can be upper bound by the privacy guarantees of the well-studied independent (or Poisson) subsampling in which each step uses the user's data with probability $(1+ o(1))k/t $. Further, we provide two additional analysis techniques that lead to numerical improvements in some parameter regimes. The case of $k=1$ has been previously studied in the context of DP-SGD in Balle et al. (2020) and very recently in Chua et al. (2024). Privacy analysis of Balle et al. (2020) relies on privacy amplification by shuffling which leads to overly conservative bounds. Privacy analysis of Chua et al. (2024a) relies on Monte Carlo simulations that are computationally prohibitive in many practical scenarios and have additional inherent limitations.


Privacy Amplification by Structured Subsampling for Deep Differentially Private Time Series Forecasting

Schuchardt, Jan, Dalirrooyfard, Mina, Guzelkabaagac, Jed, Schneider, Anderson, Nevmyvaka, Yuriy, Günnemann, Stephan

arXiv.org Machine Learning

Many forms of sensitive data, such as web traffic, mobility data, or hospital occupancy, are inherently sequential. The standard method for training machine learning models while ensuring privacy for units of sensitive information, such as individual hospital visits, is differentially private stochastic gradient descent (DP-SGD). However, we observe in this work that the formal guarantees of DP-SGD are incompatible with timeseries-specific tasks like forecasting, since they rely on the privacy amplification attained by training on small, unstructured batches sampled from an unstructured dataset. In contrast, batches for forecasting are generated by (1) sampling sequentially structured time series from a dataset, (2) sampling contiguous subsequences from these series, and (3) partitioning them into context and ground-truth forecast windows. We theoretically analyze the privacy amplification attained by this structured subsampling to enable the training of forecasting models with sound and tight event- and user-level privacy guarantees. Towards more private models, we additionally prove how data augmentation amplifies privacy in self-supervised training of sequence models. Our empirical evaluation demonstrates that amplification by structured subsampling enables the training of forecasting models with strong formal privacy guarantees.


Laplace Transform Interpretation of Differential Privacy

Chourasia, Rishav, Javaid, Uzair, Sikdar, Biplap

arXiv.org Artificial Intelligence

Differential privacy (DP) [13] has become a widely adopted standard for quantifying privacy of algorithms that process statistical data. In simple terms, differential privacy bounds the influence a single data-point may have on the outcome probabilities. Being a statistical property, the design of differentially private algorithms involves a pen-and-paper analysis of any randomness internal to the processing that obscures the influence a data-point might have on its output. A clear understanding of the nature of differential privacy notions is therefore tantamount to study and design of privacy-preserving algorithms. Throughout its exploration, various functional interpretations of the concept of differential privacy have emerged over the years. These include the privacy-profile curve δ(ϵ) [5] that traces the (ϵ, δ)-DP point guarantees, the f-DP [11] view of worst-case trade-off curve between type I and type II errors for hypothesis testing membership [19, 6], the Rényi DP [23] function of order q that admits a natural analytical composition [1, 23], the view of the privacy loss distribution (PLD) [29] that allows for approximate numerical composition [20, 18], and the recent characteristic function formulation of the dominating privacy loss random variables Zhu et al. [32]. Each of these formalisms have their own properties and use-cases, and none of them seem to be superior in all aspects. Regardless of their differences, they all have some shared difficulties--certain types of manipulations on them are harder to perform in the time-domain, but considerably simpler to do in the frequency-domain. For instance, Koskela et al. [20] noted that composing PLDs of two mechanisms involve convolving their probability densities, which can be numerically approximated efficiently


The 2020 United States Decennial Census Is More Private Than You (Might) Think

Su, Buxin, Su, Weijie J., Wang, Chendi

arXiv.org Machine Learning

The U.S. Decennial Census serves as the foundation for many high-profile policy decision-making processes, including federal funding allocation and redistricting. In 2020, the Census Bureau adopted differential privacy to protect the confidentiality of individual responses through a disclosure avoidance system that injects noise into census data tabulations. The Bureau subsequently posed an open question: Could sharper privacy guarantees be obtained for the 2020 U.S. Census compared to their published guarantees, or equivalently, had the nominal privacy budgets been fully utilized? In this paper, we affirmatively address this open problem by demonstrating that between 8.50% and 13.76% of the privacy budget for the 2020 U.S. Census remains unused for each of the eight geographical levels, from the national level down to the block level. This finding is made possible through our precise tracking of privacy losses using $f$-differential privacy, applied to the composition of private queries across various geographical levels. Our analysis indicates that the Census Bureau introduced unnecessarily high levels of injected noise to achieve the claimed privacy guarantee for the 2020 U.S. Census. Consequently, our results enable the Bureau to reduce noise variances by 15.08% to 24.82% while maintaining the same privacy budget for each geographical level, thereby enhancing the accuracy of privatized census statistics. We empirically demonstrate that reducing noise injection into census statistics mitigates distortion caused by privacy constraints in downstream applications of private census data, illustrated through a study examining the relationship between earnings and education.


Beyond the Calibration Point: Mechanism Comparison in Differential Privacy

Kaissis, Georgios, Kolek, Stefan, Balle, Borja, Hayes, Jamie, Rueckert, Daniel

arXiv.org Machine Learning

In differentially private (DP) machine learning, the privacy guarantees of DP mechanisms are often reported and compared on the basis of a single $(\varepsilon, \delta)$-pair. This practice overlooks that DP guarantees can vary substantially even between mechanisms sharing a given $(\varepsilon, \delta)$, and potentially introduces privacy vulnerabilities which can remain undetected. This motivates the need for robust, rigorous methods for comparing DP guarantees in such cases. Here, we introduce the $\Delta$-divergence between mechanisms which quantifies the worst-case excess privacy vulnerability of choosing one mechanism over another in terms of $(\varepsilon, \delta)$, $f$-DP and in terms of a newly presented Bayesian interpretation. Moreover, as a generalisation of the Blackwell theorem, it is endowed with strong decision-theoretic foundations. Through application examples, we show that our techniques can facilitate informed decision-making and reveal gaps in the current understanding of privacy risks, as current practices in DP-SGD often result in choosing mechanisms with high excess privacy vulnerabilities.