UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Sun, Yuhao, Xu, Zhuoer, Cui, Shiwen, Yang, Kun, Yu, Lingyun, Zhang, Yongdong, Xie, Hongtao

Oct-3-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^\circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Oct-3-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (1.00)

Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Energy > Power Industry (1.00)
- Health & Medicine (0.92)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.90)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)