Curvature-Aware Safety Restoration In LLMs Fine-Tuning
Bach, Thong, Nguyen-Tang, Thanh, Nguyen, Dung, Le, Thao Minh, Tran, Truyen
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) encode safety-aligned behaviors during pretraining, but these safeguards deteriorate during task-specific fine-tuning, a phenomenon we identify as safety alignment drift. Studies demonstrate that even minimal fine-tuning can compromise safety mechanisms, with models like GPT -3.5 Turbo becoming consistently unsafe after adaptation on just 10 adversarial examples [Qi et al., 2023a]. Attempts to address this issue by modifying model behavior generally fall into two main categories, both of which suffer from inherent limitations. Behavioral unlearning methods attempt to remove undesirable knowledge or responses [Cao and Y ang, 2015, Bourtoule et al., 2021a], but often require costly retraining or risk catastrophic forgetting. Model editing approaches aim to update factual associations or local behaviors through direct parameter intervention [Meng et al., 2022, Mitchell et al., 2022], yet struggle to generalize beyond narrow scopes or isolated prompts. To solve these issues, we propose a new direction that treats safety behavior as an intrinsic property of the model's geometry and seeks to restore alignment through curvature-aware navigation of the loss landscape. Our key insight, supported by extensive empirical analysis (Section 2), is that models preserve notable structural properties in their loss landscapes with respect to harmful content after fine-tuning. Specifically, we observe high correlations in models' responses to harmful inputs before and after fine-tuning, despite substantial divergence in other functional behaviors. This suggests that safety mechanisms remain largely preserved in the parameter space, merely shifted to less dominant regions during task-specific optimization.
arXiv.org Artificial Intelligence
Nov-25-2025
- Country:
- North America > United States (0.46)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Information Technology (0.46)
- Technology: