A Closer Look at Machine Unlearning for Large Language Models

Yuan, Xiaojian, Pang, Tianyu, Du, Chao, Chen, Kejiang, Zhang, Weiming, Lin, Min

Nov-20-2024–arXiv.org Artificial Intelligence

Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. In recent years, large language models (LLMs) have undergone rapid development, demonstrating impressive capabilities across a wide range of applications, from natural language processing to complex problem-solving. These concerns are particularly relevant within legal and regulatory frameworks, such as the Right to be Forgotten (Dang, 2021), which aims to empower individuals to have unauthorized data erased from digital records. Addressing these issues is crucial for ensuring the responsible deployment of LLMs in real-world applications. Due to the high cost of retraining LLMs, researchers have explored machine unlearning techniques, namely LLM unlearning (Cao & Yang, 2015; Bourtoule et al., 2021; Yao et al., 2023). The typical paradigm involves fine-tuning the target LLM on a specified set, known as the forget set, to obtain an unlearned model. As described in (Maini et al., 2024; Jin et al., 2024), the unlearned model should meet two primary goals: 1) it should not reveal any information contained in the forget set, and 2) it should maintain performance on the neighbor set, which has a distribution similar to the forget set but is not the target of unlearning, as well as on other tasks with general knowledge. While the first goal is generally easier to achieve, the main challenge lies in meeting the second goal (Liu et al., 2024b; Maini et al., 2024; Zhang et al., 2024a; Ji et al., 2024; Shi et al., 2024a; Wang et al., 2024c). In this paper, we have a closer look at machine unlearning for LLMs. We note that most prior studies (Maini et al., 2024; Ji et al., 2024; Jia et al., 2024; Jin et al., 2024; Shi et al., 2024a) primarily rely on ROUGE (Lin, 2004) as the sole metric for evaluating the output of unlearned models.

arxiv preprint arxiv, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

Nov-20-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East (0.68)
- Europe (1.00)
- North America > United States
  - California (0.28)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)
- Law (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)