What Makes and Breaks Safety Fine tuning A Mechanistic Study
–Neural Information Processing Systems
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., "design") versus the specific concepts the task is asked to be performed upon (e.g., a "cycle" vs. a "bomb").
Neural Information Processing Systems
Feb-17-2026, 07:39:15 GMT
- Country:
- North America > United States
- Michigan (0.04)
- Massachusetts > Norfolk County
- Wellesley (0.04)
- Europe
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Latvia > Lubāna Municipality
- Lubāna (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- United Kingdom > England
- North America > United States
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Information Technology (0.69)
- Technology: