Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning

Yang, Nakyeong, Kim, Dong-Kyum, Kwon, Jea, Kim, Minsung, Jung, Kyomin, Cha, Meeyoung

Sep-29-2025–arXiv.org Artificial Intelligence

Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models. Large language models (LLMs) are built on vast corpora of web-scale data, equipping them with broad capabilities across diverse tasks. Y et, this scale introduces privacy risks, as training datasets may inadvertently contain sensitive or personally identifiable information. In response, prior works have explored strategies to remove private or sensitive knowledge from LLMs. Such approaches include gradient-based interventions (Jang et al., 2022; Maini et al., 2024), preference-driven optimization frameworks (Jin et al., 2024; Y ang et al., 2025), and representation learning techniques (Li et al., 2024), each of which aims to mitigate privacy risks embedded in model parameters. Despite these efforts, prior studies reveal that existing unlearning techniques often fail to robustly eliminate target knowledge. Models subjected to such interventions remain susceptible to prompt-based elicitation (Jin et al., 2024; Y ang et al., 2025) and can inadvertently recover forgotten information through representational shifts introduced by subsequent training (Deeb & Roger, 2024; Hu et al., 2024).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Sep-29-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found