What Makes and Breaks Safety Fine tuning A Mechanistic Study
–Neural Information Processing Systems
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., "design") versus the specific concepts the task is asked to be performed upon (e.g., a "cycle" vs. a "bomb").
Neural Information Processing Systems
Feb-17-2026, 07:39:15 GMT
- Country:
- Europe
- Denmark > Capital Region
- Copenhagen (0.04)
- Latvia > Lubāna Municipality
- Lubāna (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Denmark > Capital Region
- North America > United States
- Massachusetts > Norfolk County
- Wellesley (0.04)
- Michigan (0.04)
- Massachusetts > Norfolk County
- Europe
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Information Technology (0.69)
- Technology: