Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Atwal, Tevin, Tieu, Chan Nam, Yuan, Yefeng, Shi, Zhan, Liu, Yuhong, Cheng, Liang

Jul-25-2025–arXiv.org Artificial Intelligence

The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs' capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

Jul-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Santa Clara County
    - Palo Alto (0.04)
  - Hawaii (0.05)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found