SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Djuhera, Aladin, Kadhe, Swanand Ravindra, Ahmed, Farhan, Zawad, Syed, Boche, Holger

arXiv.org Artificial Intelligence 

Fine-tuning large language models (LLMs) on downstream tasks can inadvertently erode their safety alignment, even for benign fine-tuning datasets. It achieves this by selectively merging fine-tuned and safety-aligned model layers only when those deviate from safe behavior, measured by a cosine similarity criterion. We evaluate SafeMERGE against other fine-tuning-and post-fine-tuning-stage approaches for Llama-2-7B-Chat and Qwen-2-7B-Instruct models on GSM8K and PubMedQA tasks while exploring different merging strategies. We find that SafeMERGE consistently reduces harmful outputs compared to other baselines without significantly sacrificing performance, sometimes even enhancing it. The results suggest that our selective, subspace-guided, and per-layer merging method provides an effective safeguard against the inadvertent loss of safety in fine-tuned LLMs while outperforming simpler post-fine-tuning-stage defenses. Large language models (LLMs) have demonstrated remarkable capabilities in text generation and understanding while becoming increasingly accessible to AI practitioners. Safety tuning is critical to ensure that advanced LLMs align with human values and security policies, making them safe for deployment (Ouyang et al., 2022; Bai et al., 2022; Chiang et al., 2023; Zhang et al., 2024). However, the safety alignment of current LLMs has been shown to be vulnerable (Wei et al., 2023; Huang et al., 2024e; Yang et al., 2023; Zeng et al., 2024; Zhan et al., 2024; Qi et al., 2023; 2024a).