TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference
Zhang, Dan, Cai, Min, Light, Jonathan, Hu, Ziniu, Yue, Yisong, Tang, Jie
–arXiv.org Artificial Intelligence
Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences (TD) for training-time reinforcement learning and inference-time verification. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies in 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- North America
- Canada > Alberta (0.14)
- United States > California (0.04)
- South America > Chile
- North America
- Genre:
- Research Report (1.00)
- Workflow (0.93)
- Technology: