ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

Jun-19-2026, 04:46:15 GMT–Neural Information Processing Systems

Self-improvement via RL often fails on complex reasoning tasks because GRPOstyle post-training methods rely on the model's initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the model already knows (distribution-sharpening) rather than enabling the model to solve problems where it initially generates no correct solutions. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training.

large language model, machine learning, reinforcement learning, (20 more...)

Neural Information Processing Systems

Jun-19-2026, 04:46:15 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.67)

Industry:
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language
    - Large Language Model (0.95)
    - Chatbot (0.69)
  - Machine Learning
    - Neural Networks > Deep Learning (0.69)
    - Reinforcement Learning (0.65)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found