What Makes and Breaks Safety Fine tuning A Mechanistic Study

Neural Information Processing Systems 

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., "design") versus the specific concepts the task is asked to be performed upon (e.g., a "cycle" vs. a "bomb").

Similar Docs  Excel Report  more

TitleSimilaritySource
None found