Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

Huang, Tiansheng, Hu, Sihao, Ilhan, Fatih, Tekin, Selim Furkan, Yahn, Zachary, Xu, Yichang, Liu, Ling

Mar-1-2025–arXiv.org Artificial Intelligence

Safety alignment is an important procedure before the official deployment of a Large Language Model (LLM). While safety alignment has been extensively studied for LLM, there is still a large research gap for Large Reasoning Models (LRMs) that equip with improved reasoning capability. We in this paper systematically examine a simplified pipeline for producing safety aligned LRMs. With our evaluation of various LRMs, we deliver two main findings: i) Safety alignment can be done upon the LRM to restore its safety capability. ii) Safety alignment leads to a degradation of the reasoning capability of LRMs. The two findings show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline. The discovered trade-off, which we name Safety Tax, should shed light on future endeavors of safety research on LRMs. As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment. Our source code is available at https://github.com/git-disl/Safety-Tax.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Mar-1-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology > Security & Privacy (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found