Rethinking Intermediate Representation for VLM-based Robot Manipulation

Tang, Weiliang, Gao, Jialin, Pan, Jia-Hui, Wang, Gang, Li, Li Erran, Liu, Yunhui, Ding, Mingyu, Heng, Pheng-Ann, Fu, Chi-Wing

Nov-25-2025–arXiv.org Artificial Intelligence

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Y et, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar . Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-25-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.68)
  - Robots > Manipulation (0.61)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found