Better Private Distribution Testing by Leveraging Unverified Auxiliary Data

Aliakbarpour, Maryam, Burudgunte, Arnav, Cannone, Clément, Rubinfeld, Ronitt

Mar-18-2025–arXiv.org Artificial Intelligence

Accurately analyzing data while preserving individual privacy is a fundamental challenge in statistical inference. Since its formulation nearly two decades ago, Differential Privacy (DP) [DMNS06] has emerged as the leading framework for privacy-preserving data analysis, providing strong mathematical privacy guarantees and gaining adoption by major entities such as the U.S. Census Bureau, Amazon [Ama24], Google [EPK14], Microsoft [DKY17], and Apple [Dif17; TVVKFSD17]. Unfortunately, DP guarantees often come at the cost of increased data requirements or computational resources, which has limited the widespread adoption of differential privacy in spite of its theoretical appeal. To address this issue, a recent line of work has investigated whether access to even small amounts of additional public data could help mitigate this loss of performance. Promising results for various tasks have been shown, both experimentally [KST20; LLHR24; BZHZK24; DORKSF24] and theoretically [BKS22; BBCKS23]. The use of additional auxiliary information is very enticing, as such access is available in many real-world applications: for example, hospitals handling sensitive patient data might leverage public datasets, records from different periods or locations, or synthetic data generated by machine learning models to improve analysis. Similarly, medical or socio-econonomic studies focusing on a minority or protected group can leverage statistical data from the overall population. However, integrating public data introduces its own challenges, as it often lacks guarantees regarding its accuracy or relevance to private datasets.

algorithm, probability, tester, (16 more...)

arXiv.org Artificial Intelligence

Mar-18-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Middle East
  - Morocco (0.14)
- Europe > Austria
  - Vienna (0.14)
- North America > United States
  - California (0.14)
  - Hawaii (0.14)
  - New Jersey (0.14)

Industry:
- Government > Regional Government
  - North America Government > United States Government (0.88)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning (1.00)
  - Data Science > Data Mining
    - Big Data (0.48)
  - Security & Privacy (1.00)