Representation Noising: A Defence Mechanism Against Harmful Finetuning Jan Wehner 2 Kai Williams 3

May-28-2025, 14:17:35 GMT–Neural Information Processing Systems

Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. Even without the open release of weights, weight stealing and fine-tuning APIs make closed models vulnerable to harmful fine-tuning attacks (HFAs). While safety measures like preventing jailbreaks and improving safety guardrails are important, such measures can easily be reversed through fine-tuning.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

May-28-2025, 14:17:35 GMT

Conferences PDF

Add feedback

Country:
- Europe (0.45)
- North America > Canada
  - Ontario > Toronto (0.14)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (0.93)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)