What Makes and Breaks Safety Fine tuning A Mechanistic Study

Feb-17-2026, 07:39:15 GMT–Neural Information Processing Systems

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., "design") versus the specific concepts the task is asked to be performed upon (e.g., a "cycle" vs. a "bomb").

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Feb-17-2026, 07:39:15 GMT

Conferences PDF

Add feedback

Country:
- North America > United States
  - Michigan (0.04)
  - Massachusetts > Norfolk County
    - Wellesley (0.04)
- Europe
  - United Kingdom > England
    - Oxfordshire > Oxford (0.04)
  - Latvia > Lubāna Municipality
    - Lubāna (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.67)

Industry:
- Information Technology (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
a9bef53eb7b0e5950d4f2d9c74a16006-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found