Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability
Lee, Yu-Ting, Chang, Fu-Chieh, Shih, Hui-Ying, Wu, Pei-Yuan
–arXiv.org Artificial Intelligence
Despite its empirical success, the mechanism of intrinsic self-correction remains unclear. Prior work has attributed it to reduced model uncertainty and argues that performance gains stem from activatingtask-relevant latentconcepts,as shown by probing [3]. Complementarily, Liu et al. [4] probe morality in attention and MLP activations, contending that intrinsic moral self-correctionmay merely exploit a shortcut to produce more moraloutputs.Alonga relatedaxis,Li et al. [8] identifymodel confidence as a crucial factor for intrinsic self-correction, and argue that ignoring it can cause over-criticism and unreliable assessments of self-correction efficacy. Theoretically, Wang et al. [9] view self-correction through in-context learning: selfexaminations act as reward signals that let LLMs iteratively refine responses without parameter updates. What is missing is a mechanistic analysis of how selfcorrection prompts steer a model's internal representations. Specifically, existing works only reveal what is encoded in activations, but not how prompting causally displaces representations during generation. We directly analyze the displacement in representation space induced by prompting, leading us to ask: Does intrinsic self-correction prompting act as representation steering along interpretable latent directions? We approach this research question via mechanistic interpretability, with a methodology consisting of the following steps: (1) We define a prompt-induced shift from a selfcorrection prompt as the round-wise difference in activations.
arXiv.org Artificial Intelligence
Dec-12-2025