Right Question is Already Half the Answer: Fully Unsupervised LLMReasoning Incentivization

Jun-17-2026, 18:18:01 GMT–Neural Information Processing Systems

Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions-such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization (EMPO), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By minimizing the semantic entropy of LLMs on unlabeled questions, EMPO achieves competitive performance compared to supervised counterparts. Specifically, without any external supervision, EMPO boosts the accuracy of Qwen2.5-Math-7BBase from 33.7% to 51.6% on math benchmarks and improves the accuracy of Qwen2.5-7BBase from 32.1% to 50.1% on MMLU-Pro. Primary analysis are also provided to interpret the effectiveness of EMPO.

large language model, machine learning, qwen2, (17 more...)

Neural Information Processing Systems

Jun-17-2026, 18:18:01 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.28)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Education (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found