Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning
Yang, Nakyeong, Kim, Dong-Kyum, Kwon, Jea, Kim, Minsung, Jung, Kyomin, Cha, Meeyoung
–arXiv.org Artificial Intelligence
Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models. Large language models (LLMs) are built on vast corpora of web-scale data, equipping them with broad capabilities across diverse tasks. Y et, this scale introduces privacy risks, as training datasets may inadvertently contain sensitive or personally identifiable information. In response, prior works have explored strategies to remove private or sensitive knowledge from LLMs. Such approaches include gradient-based interventions (Jang et al., 2022; Maini et al., 2024), preference-driven optimization frameworks (Jin et al., 2024; Y ang et al., 2025), and representation learning techniques (Li et al., 2024), each of which aims to mitigate privacy risks embedded in model parameters. Despite these efforts, prior studies reveal that existing unlearning techniques often fail to robustly eliminate target knowledge. Models subjected to such interventions remain susceptible to prompt-based elicitation (Jin et al., 2024; Y ang et al., 2025) and can inadvertently recover forgotten information through representational shifts introduced by subsequent training (Deeb & Roger, 2024; Hu et al., 2024).
arXiv.org Artificial Intelligence
Sep-29-2025
- Country:
- Asia > South Korea
- North America > United States
- Virginia (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: