When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Yan, Hanqi, Xu, Hainiu, Qi, Siya, Yang, Shu, He, Yulan

Oct-14-2025–arXiv.org Artificial Intelligence

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe
  - Austria > Vienna (0.14)
  - Portugal > Lisbon
    - Lisbon (0.04)
- North America > United States
  - Florida > Miami-Dade County > Miami (0.04)

Genre:
- Research Report (1.00)

Industry:
- Education (0.67)
- Law > Criminal Law (0.46)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (0.88)
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found