Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models

Tan, Shen, Zhou, Dong, Shao, Xiangyu, Wang, Junqiao, Sun, Guanghui

arXiv.org Artificial Intelligence 

Open-vocabulary mobile manipulation (OVMM) that involves the handling of novel and unseen objects across different workspaces remains a significant challenge for real-world robotic applications. In this paper, we propose a novel Language-conditioned Open-V ocabulary Mobile Manipulation framework, named LOVMM, incorporating the large language model (LLM) and vision-language model (VLM) to tackle various mobile manipulation tasks in household environments. "toss the food boxes on the office room desk to the trash bin in the corner", and "pack the bottles from the bed to the box in the guestroom"). Extensive experiments simulated in complex household environments show strong zero-shot generalization and multi-task learning abilities of LOVMM. Moreover, our approach can also generalize to multiple tabletop manipulation tasks and achieve better success rates compared to other state-of-the-art methods. 1 Introduction As one of the key capabilities for robotic home assistance, open-vocabulary mobile manipulation (OVMM), which leverages vision cameras to navigate in the environment and execute human-like actions to manipulate unseen objects, has attracted wide attention. It is crucial for addressing real-world challenges such as object sorting and rearrangement [ Zeng et al., 2022 ], [ Gan et al., 2022 ], household cleanup [ Y anet al., 2021 ], [ Wu et al., 2023 ], and human assistance [ Y enamandraet al., 2023 ], [ Stone et al., 2023 ] . Traditionally, robotic manipulation relies on vision-based methods that use explicit, object-centric representations, including poses, categories, and instance segmentations for perception [ Pan et al., 2023 ], [ Geng et al., 2023a ], [ Xie et al., 2020] . Recently, end-to-end models that learn from expert demonstrations have emerged as promising alternatives [ Zeng et al., 2021 ], [ Seita et al., 2021 ], [ Geng et al., 2023b ] . By leveraging visual observations without any explicit object information, these models are able to extract more generalizable representations across different tasks and zero-shot adapt to unseen scenarios. Y et, such methods are limited by the insufficient information provided by the single-modal data, or they may require goal images as instructions to adapt to new situations.