Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

van Niekerk, Carel, Vukovic, Renato, Ruppik, Benjamin Matthias, Lin, Hsien-chin, Gašić, Milica

Jul-30-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model's probability estimates -- restoring well-behaved calibration -- and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model's own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.

large language model, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

Jul-30-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.93)
- Asia (0.68)
- North America > United States
  - Minnesota (0.28)

Genre:
- Research Report (0.64)

Industry:
- Education (0.35)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found