Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape

Niu, Haoran, Barber, K. Suzanne

arXiv.org Artificial Intelligence 

--It is difficult for individuals and organizations to protect personal information without a fundamental understanding of relative privacy risks. By analyzing over 5,000 empirical identity theft and fraud cases, this research identifies which types of personal data are exposed, how frequently exposures occur, and what the consequences of those exposures are. We construct an Identity Ecosystem graph--a foundational, graph-based model in which nodes represent personally identifiable information (PII) attributes and edges represent empirical disclosure relationships between them (e.g., the probability that one PII attribute is exposed due to the exposure of another). Leveraging this graph structure, we develop a privacy risk prediction framework that uses graph theory and graph neural networks to estimate the likelihood of further disclosures when certain PII attributes are compromised. The results show that our approach effectively answers the core question: Can the disclosure of a given identity attribute possibly lead to the disclosure of another attribute? Different individuals and organizations have different sets of personally identifiable information (PII), and therefore have different perspectives on which PII attributes are more vulnerable, more valuable, and in greater need of protection. An individual's PII includes personal data in four different categories--What you Know (e.g., name, address, phone number, mother's maiden name), What you Have (e.g., driver's license, Social Security Card, employee ID, passport), What you Are (e.g., fingerprint, voice, facial image), and What you Do (e.g., patterns of life such as websites visited, GPS locations visited, phone logs) [1]. Protecting PII data can be costly and time-consuming. Research has uncovered various strategies to reduce risks of unintended data disclosure [2], including statistical disclosure limitation (SDL) techniques commonly used by national statistical agencies before releasing public-use data sets.