SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
Chen, Qianzhong, Yu, Justin, Schwager, Mac, Abbeel, Pieter, Shentu, Yide, Wu, Philipp
–arXiv.org Artificial Intelligence
Large scale robot learning has recently shown promise in enabling robots to perform complex tasks by integrating perception, control, and optionally, language understanding into a unified framework. However, they continue to struggle with long-horizon, contact-rich manipulation tasks, such as the handling of deformable objects, where supervision from demonstrations is often inconsistent in quality. In such settings, reward modeling offers a natural solution: by providing grounded progress signals, it can transform noisy demonstrations into stable supervision that generalizes across diverse trajectories. In this work, we introduce a stage-aware, video-based reward modeling framework that jointly predicts the high-level task stage and fine-grained progress within each stage. Reward labels are automatically derived from natural language subtask annotations, enabling consistent progress estimation across variable-length and heterogeneous demonstrations. This design overcomes the limitations of frame-index-based labeling, which collapses in long, variable-duration tasks such as folding a T -shirt. Our reward model demonstrates robustness to demonstration variability, generalization to out-of-distribution scenarios, and strong utility for downstream policy training. Building upon this reward model, we propose the Reward-Aligned Behavior Cloning (RA-BC) framework, which selectively filters high-quality data and reweights training samples according to reward estimates. Extensive experiments demonstrate that the reward model outperforms baselines on out-of-distribution real robot policy rollouts and human demonstration validation. Our approach achieves 83% success on folding T -shirts from the flattened state and 67% from the crumpled state--dramatically surpassing vanilla behavior cloning, which attains only 8% and 0% success under the same training dataset, respectively. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon robotic manipulation. The long-standing vision of enabling robots to seamlessly assist humans in household chores has inspired decades of research in robotics. From tidying living spaces to preparing meals, such capabilities hold the promise of freeing up human time, and improving quality of life.
arXiv.org Artificial Intelligence
Oct-31-2025
- Country:
- North America > United States
- California > Santa Clara County
- Palo Alto (0.04)
- Illinois > Cook County
- Chicago (0.04)
- California > Santa Clara County
- North America > United States
- Genre:
- Research Report > New Finding (1.00)
- Technology: