Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond

Fan, Chongyu, Jia, Jinghan, Zhang, Yihua, Ramakrishna, Anil, Hong, Mingyi, Liu, Sijia

arXiv.org Artificial Intelligence 

With the rapid advancement of large language models (LLMs), concerns about their privacy, safety, and trustworthiness, have become increasingly prominent (Liu et al., 2024d; Barez et al., 2025). However, retraining these models to eliminate the undesired data-model influence is often infeasible due to the significant computational and time costs involved. To address this challenge, LLM unlearning (Yao et al., 2024; Eldan & Russinovich, 2023; Maini et al., 2024; Liu et al., 2024b) has emerged as a post-pretraining strategy, which aims to mitigate the impact of undesirable data (e.g., sensitive, biased, unsafe, or illegal information) and suppress associated model capabilities, thereby preventing LLMs from generating harmful content while simultaneously preserving the model's utility post-unlearning. Despite the increasing importance of LLM unlearning, several recent studies (Łucki et al., 2024; Zhang et al., 2024e; Lynch et al., 2024; Hu et al., 2024; Deeb & Roger, 2024) have identified a critical issue: LLM unlearning often lacks robustness. Specifically, the susceptibility to quickly recovering'already-unlearned' knowledge post-unlearning is evident through so-called relearning attacks (Lynch et al., 2024; Hu et al., 2024). These attacks can effectively reverse the unlearning process by leveraging lightweight fine-tuning on the unlearned model using only a small number of data from the forget dataset.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found