Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation
Zhu, Junjie, Liu, Huayu, Wang, Jin, Wen, Bangrong, Huang, Kaixiang, Li, Xiaofei, Zhan, Haiyun, Lu, Guodong
–arXiv.org Artificial Intelligence
From early Movement Primitive (MP) techniques to modern Vision-Language Models (VLMs), autonomous manipulation has remained a pivotal topic in robotics. As two extremes, VLM-based methods emphasize zero-shot and adaptive manipulation but struggle with fine-grained planning. In contrast, MP-based approaches excel in precise trajectory generalization but lack decision-making ability. To leverage the strengths of the two frameworks, we propose VL-MP, which integrates VLM with Kernelized Movement Primitives (KMP) via a low-distortion decision information transfer bridge, enabling fine-grained robotic manipulation under ambiguous situations. One key of VL-MP is the accurate representation of task decision parameters through semantic keypoints constraints, leading to more precise task parameter generation. Additionally, we introduce a local trajectory feature-enhanced KMP to support VL-MP, thereby achieving shape preservation for complex trajectories. Extensive experiments conducted in complex real-world environments validate the effectiveness of VL-MP for adaptive and fine-grained manipulation.
arXiv.org Artificial Intelligence
Mar-4-2025