Rethinking Intermediate Representation for VLM-based Robot Manipulation
Tang, Weiliang, Gao, Jialin, Pan, Jia-Hui, Wang, Gang, Li, Li Erran, Liu, Yunhui, Ding, Mingyu, Heng, Pheng-Ann, Fu, Chi-Wing
–arXiv.org Artificial Intelligence
Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Y et, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar . Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects.
arXiv.org Artificial Intelligence
Nov-25-2025
- Country:
- Asia > Myanmar
- Tanintharyi Region > Dawei (0.04)
- North America > Mexico
- Gulf of Mexico (1.00)
- Asia > Myanmar
- Genre:
- Research Report (1.00)
- Technology: