Representation Noising: A Defence Mechanism Against Harmful Finetuning

Mar-18-2026, 11:49:42 GMT–Neural Information Processing Systems

Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. Even without the open release of weights, weight stealing and fine-tuning APIs make closed models vulnerable to harmful fine-tuning attacks (HFAs). While safety measures like preventing jailbreaks and improving safety guardrails are important, such measures can easily be reversed through fine-tuning. In this work, we propose Representation Noising (\textsf{\small RepNoise}), a defence mechanism that operates even when attackers have access to the weights.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Mar-18-2026, 11:49:42 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.65)