Optimizing the Privacy-Utility Balance using Synthetic Data and Configurable Perturbation Pipelines

Sharma, Anantha, Devabhaktuni, Swetha, Mohan, Eklove

arXiv.org Artificial Intelligence 

The Banking, Financial Services, and Insurance (BFSI) sector operates on vast volumes of highly sensitive customer data, creating an enduring tension between the drive for data-driven insights and the imperative to comply with strict privacy and security regulations such as GDPR [1] and CCP A [2]. Traditional anonymization methods like masking, aggregation, k-anonymity, L-diversity, and T-closeness often degrade data quality to the point where sophisticated analytics, fraud detection, risk modeling, and machine learning applications suffer significant performance drops. Moreover, these legacy approaches can remain vulnerable to linkage and inference attacks, undermining both privacy guarantees and competitive innovation in financial institutions. The need for advanced techniques that can create privacy-preserving datasets without sacrificing analytical utility is paramount. In response, advanced techniques for creating privacy-preserving datasets have emerged, broadly categorized as purely synthetic data generation and advanced data perturbation. Purely synthetic data, often created using deep generative models (like GANs), aims to capture the statistical patterns of real data without any one-to-one mapping to real individuals. Advanced data perturbation applies carefully calibrated noise, transformations, and privacy-enhancing techniques like differential privacy to original datasets, seeking to obscure sensitive information while retaining analytical value. These methods can include context-aware transformations, where the nature of the data and its intended use inform the perturbation strategy, ensuring that the resulting dataset remains useful for specific tasks. However, the challenge remains to balance privacy and utility effectively. Traditional methods often fail to provide sufficient privacy guarantees or result in datasets that are too noisy for practical use.