Representation Noising: A Defence Mechanism Against Harmful Finetuning Jan Wehner 2 Kai Williams 3
–Neural Information Processing Systems
Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. Even without the open release of weights, weight stealing and fine-tuning APIs make closed models vulnerable to harmful fine-tuning attacks (HFAs). While safety measures like preventing jailbreaks and improving safety guardrails are important, such measures can easily be reversed through fine-tuning.
Neural Information Processing Systems
May-28-2025, 14:17:35 GMT
- Country:
- Europe (0.45)
- North America > Canada
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: