TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Open in new window