AITopics

2602.01607

Country:

North America > United States > Ohio > Wood County > Bowling Green (0.04)
North America > United States > Ohio > Summit County > Green (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.54)

Yuan, Chih-Cheng Rex, Wang, Bow-Yaw

Quantitative Auditing of AI Fairness with Differentially Private Synthetic Data

arXiv.org Artificial IntelligenceMay-1-2025

Fairness auditing of AI systems can identify and quantify biases. However, traditional auditing using real-world data raises security and privacy concerns. It exposes auditors to security risks as they become custodians of sensitive information and targets for cyberattacks. Privacy risks arise even without direct breaches, as data analyses can inadvertently expose confidential information. To address these, we propose a framework that leverages differentially private synthetic data to audit the fairness of AI systems. By applying privacy-preserving mechanisms, it generates synthetic data that mirrors the statistical properties of the original dataset while ensuring privacy. This method balances the goal of rigorous fairness auditing and the need for strong privacy protections. Through experiments on real datasets like Adult, COMPAS, and Diabetes, we compare fairness metrics of synthetic and real data. By analyzing the alignment and discrepancies between these metrics, we assess the capacity of synthetic data to preserve the fairness properties of real data. Our results demonstrate the framework's ability to enable meaningful fairness evaluations while safeguarding sensitive information, proving its applicability across critical and sensitive domains.

artificial intelligence, data mining, machine learning, (16 more...)

2504.21634

Country: North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.34)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.34)

arXiv.org Artificial IntelligenceMar-6-2025

A Consensus Privacy Metrics Framework for Synthetic Data

Pilgram, Lisa, Dankar, Fida K., Drechsler, Jorg, Elliot, Mark, Domingo-Ferrer, Josep, Francis, Paul, Kantarcioglu, Murat, Kong, Linglong, Malin, Bradley, Muralidhar, Krishnamurty, Myles, Puja, Prasser, Fabian, Raisaro, Jean Louis, Yan, Chao, Emam, Khaled El

Synthetic data generation is one approach for sharing individual-level data. However, to meet legislative requirements, it is necessary to demonstrate that the individuals' privacy is adequately protected. There is no consolidated standard for measuring privacy in synthetic data. Through an expert panel and consensus process, we developed a framework for evaluating privacy in synthetic data. Our findings indicate that current similarity metrics fail to measure identity disclosure, and their use is discouraged. For differentially private synthetic data, a privacy budget other than close to zero was not considered interpretable. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information about an individual without necessarily revealing their identity. The resultant framework provides precise recommendations for metrics that address these types of disclosures effectively. Our findings further present specific opportunities for future research that can help with widespread adoption of synthetic data.

data anonymity vulnerability measure, differentially private synthetic data, relative attribute disclosure metric, (16 more...)

2503.0498

Country:

North America > Canada > Alberta (0.14)
Europe > Netherlands (0.14)
Europe > Germany > Berlin (0.14)
(30 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
(10 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Modeling & Simulation (1.00)
Information Technology > Information Management (1.00)
(7 more...)

Lin, Zinan, Baltrusaitis, Tadas, Yekhanin, Sergey

Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model

arXiv.org Machine LearningFeb-8-2025

Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow inference APIs beyond foundation models. Specifically, we show that simulators -- such as computer graphics-based image synthesis tools -- can also serve as effective APIs within the PE framework. This insight greatly expands the applicability of PE, enabling the use of a wide variety of domain-specific simulators for DP data synthesis. We explore the potential of this approach, named Sim-PE, in the context of image synthesis. Across three diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x and reducing the FID score by up to 80%. We also show that simulators and foundation models can be easily leveraged together within the PE framework to achieve further improvements. The code is open-sourced in the Private Evolution Python library: https://github.com/microsoft/DPSDA.

machine learning, natural language, simulator, (16 more...)

2502.05505

Country:

North America > United States > Washington > King County > Redmond (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > Promising Solution (0.86)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Abacha, Fatima, Teo, Sin G., Cordeiro, Lucas C., Mustafa, Mustafa A.

Synthetic Data Aided Federated Learning Using Foundation Models

arXiv.org Artificial IntelligenceJul-6-2024

In heterogeneous scenarios where the data distribution amongst the Federated Learning (FL) participants is Non-Independent and Identically distributed (Non-IID), FL suffers from the well known problem of data heterogeneity. This leads the performance of FL to be significantly degraded, as the global model tends to struggle to converge. To solve this problem, we propose Differentially Private Synthetic Data Aided Federated Learning Using Foundation Models (DPSDA-FL), a novel data augmentation strategy that aids in homogenizing the local data present on the clients' side. DPSDA-FL improves the training of the local models by leveraging differentially private synthetic data generated from foundation models. We demonstrate the effectiveness of our approach by evaluating it on the benchmark image dataset: CIFAR-10. Our experimental results have shown that DPSDA-FL can improve class recall and classification accuracy of the global model by up to 26% and 9%, respectively, in FL with Non-IID issues.

federated learning, global model, synthetic data, (12 more...)

2407.05174

Country:

North America > United States > Virginia (0.04)
North America > United States > Tennessee > Davidson County > Nashville (0.04)
Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.68)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceMay-28-2024

Adapting Differentially Private Synthetic Data to Relational Databases

Alimohammadi, Kaveh, Wang, Hao, Gulati, Ojas, Srivastava, Akash, Azizan, Navid

Relational databases play a pivotal role in modern information systems and business operations due to their efficiency in managing structured data [39]. According to a Kaggle survey [23], 65.5% of users worked extensively with relational data. Additionally, the majority of leading database management systems (e.g., MySQL and Oracle) are built on relational database principles [35]. These systems organize data into multiple tables, each representing a specific entity, and the relationships between tables delineate the connections between these entities. However, the widespread use of relational databases also carries a significant risk of privacy leakage.

algorithm, database, syn, (12 more...)

2405.1867

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Sicily (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.93)
(2 more...)

Bojkovic, Nikolija, Loh, Po-Ling

Differentially Private Synthetic Data with Private Density Estimation

arXiv.org Machine LearningMay-6-2024

The need to analyze sensitive data, such as medical records or financial data, has created a critical research challenge in recent years. In this paper, we adopt the framework of differential privacy, and explore mechanisms for generating an entire dataset which accurately captures characteristics of the original data. We build upon the work of Boedihardjo et al, which laid the foundations for a new optimization-based algorithm for generating private synthetic data. Importantly, we adapt their algorithm by replacing a uniform sampling step with a private distribution estimator; this allows us to obtain better computational guarantees for discrete distributions, and develop a novel algorithm suitable for continuous distributions. We also explore applications of our work to several statistical tasks.

accuracy, algorithm 1, synthetic data, (16 more...)

2405.04554

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
Europe > Serbia > Central Serbia > Belgrade (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.87)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.41)

arXiv.org Artificial IntelligenceMar-4-2024

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Xie, Chulin, Lin, Zinan, Backurs, Arturs, Gopi, Sivakanth, Yu, Da, Inan, Huseyin A, Nori, Harsha, Jiang, Haotian, Zhang, Huishuai, Lee, Yin Tat, Li, Bo, Yekhanin, Sergey

Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.

gpt-3, synthetic, ug -pe, (13 more...)

2403.01749

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-10-2020

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Rosenblatt, Lucas, Liu, Xiaoyan, Pouyanfar, Samira, de Leon, Eduardo, Desai, Anuj, Allen, Joshua

Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the efficacy of differentially private synthetic data? In this paper, we survey four differentially private generative adversarial networks for data synthesis. We evaluate each of them at scale on five standard tabular datasets, and in two applied industry scenarios. Our results suggest some synthesizers are more applicable for different privacy budgets, and we further demonstrate complicating domain-based tradeoffs in selecting an approach. We offer experimental learning on applied machine learning scenarios with private internal data to researchers and practioners alike. In addition, we propose QUAIL, an ensemble-based modeling approach to generating synthetic data. We examine QUAIL's tradeoffs, and note circumstances in which it outperforms baseline differentially private supervised learning models under the same budget constraint. Maintaining an individual's privacy is a major concern when collecting sensitive information from groups or organizations. A formalization of privacy, known as differential privacy, has become the gold standard with which to protect information from malicious agents (Dwork et al., TAMC 2008).

dataset, synthesizer, synthetic data, (13 more...)

2011.05537

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Arnold, Christian, Neunhoeffer, Marcel

Really Useful Synthetic Data -- A Framework to Evaluate the Quality of Differentially Private Synthetic Data

arXiv.org Machine LearningApr-16-2020

Recent advances in generating synthetic data that allow to add principled ways of protecting privacy -- such as Differential Privacy -- are a crucial step in sharing statistical information in a privacy preserving way. But while the focus has been on privacy guarantees, the resulting private synthetic data is only useful if it still carries statistical information from the original data. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. What is it that data analysts want? Acknowledging that data quality is a subjective concept, we develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective. Data quality can be measured along two dimensions. First, quality of synthetic data can be evaluated against training data or against an underlying population. Second, the quality of synthetic data depends on general similarity of distributions or specific tasks such as inference or prediction. It is clear that accommodating all goals at once is a formidable challenge. We invite the academic community to jointly advance the privacy-quality frontier.

differentially private synthetic data, private synthetic data, synthetic data, (12 more...)

2004.0774

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
Europe > United Kingdom (0.04)
(4 more...)

Genre: Research Report > Experimental Study (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)