Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
–Neural Information Processing Systems
The new paradigm of fine-tuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine-tuning to produce an alignment-broken model. We conduct an empirical analysis and uncovera \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users fine-tuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the fine-tuning phase.
Neural Information Processing Systems
Mar-17-2025, 05:29:46 GMT
- Genre:
- Research Report > New Finding (0.44)
- Industry:
- Health & Medicine > Therapeutic Area
- Immunology (1.00)
- Vaccines (1.00)
- Health & Medicine > Therapeutic Area
- Technology: