Incentivizing LLMs to Self-Verify Their Answers
–Neural Information Processing Systems
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While pre models valent to test-time guide the scaling model approaches generation are process, often realized we find that by using only e mar xternal ginal re g w ains ard can be acquired when scaling a model post-trained on specific reasoning tasks. W between e identify the that specific the limited post-trained improv generator ement stems and from the general distributi rew on ard disc model.
Neural Information Processing Systems
Jun-14-2026, 11:52:24 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Research Report
- Industry:
- Education (0.46)
- Technology: