Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond

Fan, Chongyu, Jia, Jinghan, Zhang, Yihua, Ramakrishna, Anil, Hong, Mingyi, Liu, Sijia

Feb-7-2025–arXiv.org Artificial Intelligence

With the rapid advancement of large language models (LLMs), concerns about their privacy, safety, and trustworthiness, have become increasingly prominent (Liu et al., 2024d; Barez et al., 2025). However, retraining these models to eliminate the undesired data-model influence is often infeasible due to the significant computational and time costs involved. To address this challenge, LLM unlearning (Yao et al., 2024; Eldan & Russinovich, 2023; Maini et al., 2024; Liu et al., 2024b) has emerged as a post-pretraining strategy, which aims to mitigate the impact of undesirable data (e.g., sensitive, biased, unsafe, or illegal information) and suppress associated model capabilities, thereby preventing LLMs from generating harmful content while simultaneously preserving the model's utility post-unlearning. Despite the increasing importance of LLM unlearning, several recent studies (Łucki et al., 2024; Zhang et al., 2024e; Lynch et al., 2024; Hu et al., 2024; Deeb & Roger, 2024) have identified a critical issue: LLM unlearning often lacks robustness. Specifically, the susceptibility to quickly recovering'already-unlearned' knowledge post-unlearning is evident through so-called relearning attacks (Lynch et al., 2024; Hu et al., 2024). These attacks can effectively reverse the unlearning process by leveraging lightweight fine-tuning on the unlearned model using only a small number of data from the forget dataset.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-7-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found