CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation
Lee, Sung-Wook, Kang, Xuhui, Yang, Brandon, Kuo, Yen-Ling
–arXiv.org Artificial Intelligence
Behavior Cloning (BC) has demonstrated strong performance in robotic manipulation by leveraging expressive models and action sequence modeling. Efforts to improve BC have focused on large-scale dataset collection [1, 2] and advances in model architectures [3, 4, 5] to better capture the complex distribution of demonstration data. However, expressive policies often struggle to generalize, especially when trained on demonstrations collected under heterogeneous conditions--that is, where the policy must adapt to additional properties not present in homogeneous data, such as changes in viewpoint or object appearance [6, 7]. This suggests a tendency to overfit individual actions and a limited ability to capture shared structure across demonstrations [8]. To address this, we propose Contrastive Learning via Action Sequence Supervision (CLASS), a framework for learning behaviorally grounded representations from demonstrations using supervised contrastive learning. Rather than relying on direct action prediction, CLASS supervises the encoder by aligning observations based on action sequence similarity, measured via Dynamic Time Warping (DTW), encouraging states that lead to similar future behaviors to cluster in the latent space. This weak supervision enables the model to capture shared structure across demonstrations, improving robustness to variations in visual conditions such as camera pose and object appearance. The learned representation supports both retrieval-based inference and policy fine-tuning, and consistently improves performance across both homogeneous and heterogeneous data settings. Across a range of simulated and real-world robotic manipulation tasks, CLASS achieves strong gains over behavior cloning and representation learning baselines, demonstrating its ability to learn more transferable and composable behavioral representations.
arXiv.org Artificial Intelligence
Aug-5-2025
- Country:
- North America > United States > Virginia (0.04)
- Genre:
- Workflow (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Robots > Manipulation (0.50)
- Information Technology > Artificial Intelligence