Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility

Kuo, Martin, Zhang, Jingyang, Zhang, Jianyi, Tang, Minxue, DiValentin, Louis, Ding, Aolin, Sun, Jingwei, Chen, William, Hass, Amin, Chen, Tianlong, Chen, Yiran, Li, Hai

arXiv.org Artificial Intelligence 

With the rise of large language models (LLMs), increasing research has recognized their risk of leaking personally identifiable information (PII) under malicious attacks. Although efforts have been made to protect PII in LLMs, existing methods struggle to balance privacy protection with maintaining model utility. In this paper, inspired by studies of amnesia in cognitive science, we propose a novel approach, Proactive Privacy Amnesia (PPA), to safeguard PII in LLMs while preserving their utility. This mechanism works by actively identifying and forgetting key memories most closely associated with PII in sequences, followed by a memory implanting using suitable substitute memories to maintain the LLM's functionality. We conduct evaluations across multiple models to protect common PII, such as phone numbers and physical addresses, against prevalent PII-targeted attacks, demonstrating the superiority of our method compared with other existing defensive techniques. The results show that our PPA method completely eliminates the risk of phone number exposure by 100% and significantly reduces the risk of physical address exposure by 9.8% - 87.6%, all while maintaining comparable model utility performance. Large Language Models (LLMs) (Touvron et al., 2023; Achiam et al., 2023; Team et al., 2023; Dubey et al., 2024) have achieved remarkable success in recent years, with their wide adoption either as general-purpose models or, after fine-tuning, as specialized and personal assistants. Despite their success, LLMs with huge parameter counts and great capacity in the meantime exhibit the concerning "memorization" phenomenons (Carlini et al., 2019; 2021), i.e., they can precisely memorize some training data. Such memorization is vulnerable to various attacks (e.g., membership inference attacks and data extraction attacks) and risks severe privacy breaches. One of the most serious concerns comes from the attacks that aim to extract personal identifiable information (PII) memorized by the models, which compromise users' privacy and are likely to cause real-world harm consequently. To defend against such PII or data extraction attacks, several machine unlearning techniques have been applied to LLMs. However, existing methods typically fall short in terms of the trade-off between the defense performance and model utility. For example, most unlearning approaches are based on gradient ascent (Jang et al., 2022; Wang et al., 2024) and often adversely affect model functionalities to an extent where the model cannot handle their original tasks anymore and thus becomes no longer useful.