Trajectory Bellman Residual Minimization: ASimple Value-Based Method for LLMReasoning

Jun-21-2026, 12:50:32 GMT–Neural Information Processing Systems

Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as Q-values. TBRM removes the need for critics, importancesampling ratios, or clipping, and can operate with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary offpolicy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM matches or surpasses policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

large language model, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Jun-21-2026, 12:50:32 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found