Supervised Optimism Correction: Be Confident When LLMs Are Sure

Zhang, Junjie, Yang, Rushuai, Liu, Shunyu, Lin, Ting-En, Huang, Fei, Chen, Yi, Li, Yongbin, Tao, Dacheng

Jun-3-2025–arXiv.org Artificial Intelligence

In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit $Q$-function for inference. Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated $Q$-value estimations of suboptimal steps. To address this limitation, we propose Supervised Optimism Correction(SOC), which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularization to boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses. Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Jun-3-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.68)
- North America > United States (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Search (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Learning Graphical Models
    - Undirected Networks > Markov Models (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found