Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

Zhao, Weixiang, Hu, Yulin, Deng, Yang, Guo, Jiahe, Sui, Xingyu, Han, Xinyang, Zhang, An, Zhao, Yanyan, Qin, Bing, Chua, Tat-Seng, Liu, Ting

Feb-28-2025–arXiv.org Artificial Intelligence

Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-28-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Thailand (0.14)
- North America > United States (0.14)

Genre:
- Research Report > New Finding (0.87)

Industry:
- Information Technology (0.67)
- Leisure & Entertainment (0.46)
- Materials (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)