Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Bach, Thong, Nguyen-Tang, Thanh, Nguyen, Dung, Le, Thao Minh, Tran, Truyen

arXiv.org Artificial Intelligence 

Large Language Models (LLMs) encode safety-aligned behaviors during pretraining, but these safeguards deteriorate during task-specific fine-tuning, a phenomenon we identify as safety alignment drift. Studies demonstrate that even minimal fine-tuning can compromise safety mechanisms, with models like GPT -3.5 Turbo becoming consistently unsafe after adaptation on just 10 adversarial examples [Qi et al., 2023a]. Attempts to address this issue by modifying model behavior generally fall into two main categories, both of which suffer from inherent limitations. Behavioral unlearning methods attempt to remove undesirable knowledge or responses [Cao and Y ang, 2015, Bourtoule et al., 2021a], but often require costly retraining or risk catastrophic forgetting. Model editing approaches aim to update factual associations or local behaviors through direct parameter intervention [Meng et al., 2022, Mitchell et al., 2022], yet struggle to generalize beyond narrow scopes or isolated prompts. To solve these issues, we propose a new direction that treats safety behavior as an intrinsic property of the model's geometry and seeks to restore alignment through curvature-aware navigation of the loss landscape. Our key insight, supported by extensive empirical analysis (Section 2), is that models preserve notable structural properties in their loss landscapes with respect to harmful content after fine-tuning. Specifically, we observe high correlations in models' responses to harmful inputs before and after fine-tuning, despite substantial divergence in other functional behaviors. This suggests that safety mechanisms remain largely preserved in the parameter space, merely shifted to less dominant regions during task-specific optimization.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found