Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability

Lee, Yu-Ting, Chang, Fu-Chieh, Shih, Hui-Ying, Wu, Pei-Yuan

Dec-12-2025–arXiv.org Artificial Intelligence

Despite its empirical success, the mechanism of intrinsic self-correction remains unclear. Prior work has attributed it to reduced model uncertainty and argues that performance gains stem from activatingtask-relevant latentconcepts,as shown by probing [3]. Complementarily, Liu et al. [4] probe morality in attention and MLP activations, contending that intrinsic moral self-correctionmay merely exploit a shortcut to produce more moraloutputs.Alonga relatedaxis,Li et al. [8] identifymodel confidence as a crucial factor for intrinsic self-correction, and argue that ignoring it can cause over-criticism and unreliable assessments of self-correction efficacy. Theoretically, Wang et al. [9] view self-correction through in-context learning: selfexaminations act as reward signals that let LLMs iteratively refine responses without parameter updates. What is missing is a mechanistic analysis of how selfcorrection prompts steer a model's internal representations. Specifically, existing works only reveal what is encoded in activations, but not how prompting causally displaces representations during generation. We directly analyze the displacement in representation space induced by prompting, leading us to ask: Does intrinsic self-correction prompting act as representation steering along interpretable latent directions? We approach this research question via mechanistic interpretability, with a methodology consisting of the following steps: (1) We define a prompt-induced shift from a selfcorrection prompt as the round-wise difference in activations.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Dec-12-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Taiwan (0.16)

Genre:
- Research Report > New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found