Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

May-27-2025, 12:32:29 GMT–Neural Information Processing Systems

Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. However, recent studies show that the alignment can be easily compromised through finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as "safety basin": random perturbations to model weights maintain the safety level of the original aligned model within its local neighborhood. However, outside this local region, safety is fully compromised, exhibiting a sharp, step-like drop.

language model, measuring risk, safety landscape, (4 more...)

Neural Information Processing Systems

May-27-2025, 12:32:29 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report > New Finding (0.41)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)