From to : Multidimensional Supervision of Reasoning Process for LLM Optimization
Wang, Beining, Su, Weihang, Tian, Hongtao, Yang, Tao, Zhou, Yujia, Yao, Ting, Ai, Qingyao, Liu, Yiqun
–arXiv.org Artificial Intelligence
Improving the multi-step reasoning ability of Large Language Models (LLMs) is a critical yet challenging task. The dominant paradigm, outcome-supervised reinforcement learning (RL VR), rewards only correct final answers, often propagating flawed reasoning and suffering from sparse reward signals. While process-level reward models (PRMs) provide denser, step-by-step feedback, they lack general-izability and interpretability, requiring task-specific segmentation of the reasoning process. To this end, we propose the Dimension-level Reward Model (DRM), a new supervision framework that bridges the gap between these two approaches. DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions: Confidence for uncertainty calibration, Relevance for semantic alignment, and Coherence for logical consistency. Together, these dimensions capture aspects beyond final answer correctness and enable interpretable assessment without requiring ground truth answers. Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability. In particular, DRM-supervised training achieves consistent gains on both in-distribution and out-of-distribution open-domain tasks, including mathematics, question answering, code execution, and puzzles. Our findings demonstrate that multidimensional supervision of the reasoning process can improve the generalized reasoning ability of LLMs beyond the training distribution. Enhancing the reasoning ability of Large Language Models (LLMs) to perform complex and multi-step reasoning remains a central challenge in their development (Zhang et al., 2025b; Xu et al., 2025). The dominant paradigm for enhancement relies on Reinforcement Learning with V erifiable Rewards (RL VR) (Shao et al., 2024; Y ang et al., 2024; Luo et al., 2024). RL VR provides supervision at the outcome level, assigning a positive reward only if the final answer is correct. However, this reward mechanism has fundamental limitations. First, answer supervision overlooks the quality of the reasoning process (Y u et al., 2025a).
arXiv.org Artificial Intelligence
Oct-14-2025
- Country:
- Oceania > New Zealand > North Island > Wellington Region > Wellington (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology > Security & Privacy (0.46)
- Technology: