Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Yin, Qingyu, Leong, Chak Tou, Yang, Linyi, Huang, Wenxuan, Li, Wenjie, Wang, Xiting, Yoon, Jaehong, YunXing, null, XingYu, null, Gu, Jinjin

Oct-8-2025–arXiv.org Artificial Intelligence

Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as refusal cliff: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3% of these heads can reduce attack success rates below 10%. Building on these mechanistic insights, we propose Cliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment. Code is available at here. Large Reasoning Models (Guo et al., 2025; Shao et al., 2024; Hugging Face, 2025), with advanced reasoning capability derived from reinforcement learning with verifiable rewards (RL VR) (Y u et al., 2025; Liu et al., 2025a), are designed to handle complex problem solving, logical inference, and tool-assisted planning.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Oct-8-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China
    - Hong Kong (0.04)
    - Shanghai > Shanghai (0.04)
  - Indonesia > Bali (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.14)
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe > Austria
  - Vienna (0.14)
- North America > Dominican Republic (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (0.86)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)
  - Natural Language
    - Chatbot (0.68)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)