Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization

Yang, Jonathan, Fu, Chuyuan Kelly, Shah, Dhruv, Sadigh, Dorsa, Xia, Fei, Zhang, Tingnan

Jun-9-2025–arXiv.org Artificial Intelligence

Figure 1: Bimanual, dexterous manipulation requires task-specific grounding. The left depicts various axes for spatial grounding as well as qualitative categorizations of different mid-level representations. Different representations lead to different levels of improvement depending on the task. Abstract --In this work, we investigate how spatially-grounded auxiliary representations can provide both broad, high-level grounding, as well as direct, actionable information to help policy learning performance and generalization for dexterous tasks. We study these mid-level representations across three critical dimensions: object-centricity, pose-awareness, and depth-awareness. We use these interpretable mid-level representations to train specialist encoders via supervised learning, then use these representations as inputs to a diffusion policy to solve dexterous bimanual manipulation tasks in the real-world. We propose a novel mixture-of-experts policy architecture that can combine multiple specialized expert models, each trained on a distinct mid-level representation, to improve the generalization of the policy. This method achieves an average of 11% higher success rate on average over a language-grounded baseline and a 24% higher success rate over a standard diffusion policy baseline for our evaluation tasks. Furthermore, we find that leveraging mid-level representations as supervision signals for policy actions within a weighted imitation learning algorithm improves the precision with which the policy follows these representations, leading to an additional performance increase of 10%. Our findings highlight the importance of grounding robot policies with not only broad, perceptual tasks, but also more granular, actionable representations. For further information and videos, please visit https://mid-level-moe.github.io. Large pre-trained robotics models have made significant progress in recent years towards improving robotic generalization capabilities by leveraging large-scale pre-training datasets. However, these models still face challenges in adapting to slight scene variations such as different spatial locations, unseen objects, and different lighting conditions.

artificial intelligence, machine learning, representation, (18 more...)

arXiv.org Artificial Intelligence

Jun-9-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Robots > Manipulation (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found