TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning
Chen, Yuhui, Li, Haoran, Jiang, Zhennan, Wen, Haowei, Zhao, Dongbin
–arXiv.org Artificial Intelligence
--Developing scalable and generalizable reward engineering for reinforcement learning (RL) is crucial for creating general-purpose agents, especially in the challenging domain of robotic manipulation. While recent advances in reward engineering with Vision-Language Models (VLMs) have shown promise, their sparse reward nature significantly limits sample efficiency. This paper introduces T eViR, a novel method that leverages a pre-trained text-to-video diffusion model to generate dense rewards by comparing the predicted image sequence with current observations. Experimental results across 13 simulation and real-world robotic tasks demonstrate that T eViR outperforms traditional methods leveraging sparse rewards and other state-of-the-art (SOT A) methods, achieving better sample efficiency and performance without ground truth environmental rewards. T eViR's ability to efficiently guide agents in complex environments highlights its potential to advance reinforcement learning applications in robotic manipulation. EVELOPING general-purpose agents with reinforcement learning (RL) necessitates scalable and generalizable reward engineering to provide effective task specifications for downstream policy learning [1]. Reward engineering is crucial as it determines the policies agents can learn and ensures they align with intended objectives. However, the manual design of reward functions often present significant challenges [2]- [4], particularly in robotic manipulation tasks [5]-[8]. This challenge has emerged as a major bottleneck in developing general-purpose agents. Although inverse reinforcement learning (IRL) [9] learns rewards from pre-collected expert demonstration, these learned reward functions are unreliable for learning policies due to noise and misspecification errors [10], especially for robotic manipulation tasks since in-domain data is limited [11]. Additionally, the learned reward functions is not generally applicable across tasks.
arXiv.org Artificial Intelligence
Jun-25-2025
- Genre:
- Research Report
- New Finding (0.46)
- Promising Solution (0.34)
- Research Report
- Industry:
- Education (0.46)
- Technology: