Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization
Yang, Jonathan, Fu, Chuyuan Kelly, Shah, Dhruv, Sadigh, Dorsa, Xia, Fei, Zhang, Tingnan
–arXiv.org Artificial Intelligence
Figure 1: Bimanual, dexterous manipulation requires task-specific grounding. The left depicts various axes for spatial grounding as well as qualitative categorizations of different mid-level representations. Different representations lead to different levels of improvement depending on the task. Abstract --In this work, we investigate how spatially-grounded auxiliary representations can provide both broad, high-level grounding, as well as direct, actionable information to help policy learning performance and generalization for dexterous tasks. We study these mid-level representations across three critical dimensions: object-centricity, pose-awareness, and depth-awareness. We use these interpretable mid-level representations to train specialist encoders via supervised learning, then use these representations as inputs to a diffusion policy to solve dexterous bimanual manipulation tasks in the real-world. We propose a novel mixture-of-experts policy architecture that can combine multiple specialized expert models, each trained on a distinct mid-level representation, to improve the generalization of the policy. This method achieves an average of 11% higher success rate on average over a language-grounded baseline and a 24% higher success rate over a standard diffusion policy baseline for our evaluation tasks. Furthermore, we find that leveraging mid-level representations as supervision signals for policy actions within a weighted imitation learning algorithm improves the precision with which the policy follows these representations, leading to an additional performance increase of 10%. Our findings highlight the importance of grounding robot policies with not only broad, perceptual tasks, but also more granular, actionable representations. For further information and videos, please visit https://mid-level-moe.github.io. Large pre-trained robotics models have made significant progress in recent years towards improving robotic generalization capabilities by leveraging large-scale pre-training datasets. However, these models still face challenges in adapting to slight scene variations such as different spatial locations, unseen objects, and different lighting conditions.
arXiv.org Artificial Intelligence
Jun-9-2025
- Country:
- North America > Montserrat (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Robots > Manipulation (0.48)
- Information Technology > Artificial Intelligence