SEM: Enhancing Spatial Understanding for Robust Robot Manipulation
Lin, Xuewu, Lin, Tianwei, Huang, Lichao, Xie, Hongyu, Jin, Yiwei, Li, Keyu, Su, Zhizhong
–arXiv.org Artificial Intelligence
Abstract-- A key challenge in robot manipulation lies in developing policy models with consistent spatial understanding--the ability to reason about 3D geometry, object relations, and robot state. Existing mainstream models take 2D images as input, without performing explicit 3D modeling, and thus lack spatial understanding capabilities as well as 3D and embodiment generalization. T o address this, we propose SEM (Spatial Enhanced Manipulation), a diffusion-based policy framework that constructs a unified spatial representation by projecting multi-view image features and joint-centric robot states into a unified 3D space. This spatially aligned representation provides a consistent feature space for the diffusion policy to condition on during action generation. Extensive experiments demonstrate that SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.
arXiv.org Artificial Intelligence
Sep-25-2025