ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Sun, Chung-En, Yan, Ge, Kulkarni, Akshay, Weng, Tsui-Wei

arXiv.org Artificial Intelligence 

Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Large Language Models (LLMs) trained with reinforcement learning (RL) to produce extended Chain-of-Thought (CoT) traces have achieved strong performance on complex tasks such as math problem solving. These models are often referred to as Large Reasoning Models (LRMs) (Guo et al., 2025; Jaech et al., 2024). Recent progress on LRMs has largely targeted efficiency and accuracy, e.g., inference-time strategies and fine-tuning methods to shorten the reasoning length or boost accuracy (Sui et al., 2025; Muennighoff et al., 2025; Hao et al., 2024; Luo et al., 2025). However, this line of work typically treats CoT as a means to improve task performance rather than as a communication medium for users to audit and understand model behavior.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found