vlmap
Multimodal Spatial Language Maps for Robot Navigation and Manipulation
Huang, Chenguang, Mees, Oier, Zeng, Andy, Burgard, Wolfram
Grounding language to a navigating agent's observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language maps as a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment. We build these maps autonomously using standard exploration. We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps) obtained by adding audio information. When combined with large language models (LLMs), VLMaps can (i) translate natural language commands into open-vocabulary spatial goals (e.g., "in between the sofa and TV") directly localized in the map, and (ii) be shared across different robot embodiments to generate tailored obstacle maps on demand. Building upon the capabilities above, AVLMaps extend VLMaps by introducing a unified 3D spatial representation integrating audio, visual, and language cues through the fusion of features from pretrained multimodal foundation models. This enables robots to ground multimodal goal queries (e.g., text, images, or audio snippets) to spatial locations for navigation. Additionally, the incorporation of diverse sensory inputs significantly enhances goal disambiguation in ambiguous environments. Experiments in simulation and real-world settings demonstrate that our multimodal spatial language maps enable zero-shot spatial and multimodal goal navigation and improve recall by 50% in ambiguous scenarios. These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.
Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation
Su, Hung-Ting, Chen, Ching-Yuan, Ko, Po-Chen, Yeh, Jia-Fong, Sun, Min, Hsu, Winston H.
Object navigation, which involves localizing and navigating to an object in indoor environments, is a critical component of robotic applications. Conventional training-based methods necessitate extensive annotations, meticulous model design, and prolonged training periods to effectively align the control actions with visual perception. Advancements in visual language models (VLMs) have led to the development of modular approaches that separate perception from actions, utilizing pre-trained knowledge. Under this framework, visual perception can be independently learned without direct control, making exploration prior to task execution an effective strategy. Pre-explored Semantic Map [1, 2, 3, 4, 5], constructed through prior exploration and using visual language models (VLMs), has become a fundamental backbone for robotics tasks.
Do Visual-Language Maps Capture Latent Semantics?
Pekkanen, Matti, Mihaylova, Tsvetomila, Verdoja, Francesco, Kyrki, Ville
Visual-language models (VLMs) have recently been introduced in robotic mapping by using the latent representations, i.e., embeddings, of the VLMs to represent the natural language semantics in the map. The main benefit is moving beyond a small set of human-created labels toward open-vocabulary scene understanding. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is lacking. We investigate two critical properties of map quality: queryability and consistency. The evaluation of queryability addresses the ability to retrieve information from the embeddings. We investigate two aspects of consistency: intra-map consistency and inter-map consistency. Intra-map consistency captures the ability of the embeddings to represent abstract semantic classes, and inter-map consistency captures the generalization properties of the representation. In this paper, we propose a way to analyze the quality of maps created using VLMs, which forms an open-source benchmark to be used when proposing new open-vocabulary map representations. We demonstrate the benchmark by evaluating the maps created by two state-of-the-art methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. We find that OpenScene outperforms VLMaps with both encoders, and LSeg outperforms OpenSeg with both methods.
Instance-Level Semantic Maps for Vision Language Navigation
Nanwani, Laksh, Agarwal, Anmol, Jain, Kanishk, Prabhakar, Raghav, Monis, Aaron, Mathur, Aditya, Murthy, Krishna, Hafez, Abdul, Gandhi, Vineet, Krishna, K. Madhava
Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment, allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recent works take a step towards this goal by creating a semantic spatial map representation of the environment without any labeled data. However, their representations are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we address this limitation by integrating instance-level information into spatial map representation using a community detection algorithm and utilizing word ontology learned by large language models (LLMs) to perform open-set semantic associations in the mapping representation. The resulting map representation improves the navigation performance by two-fold (233%) on realistic language commands with instance-specific descriptions compared to the baseline. We validate the practicality and effectiveness of our approach through extensive qualitative and quantitative experiments.
Visual language maps for robot navigation – Google AI Blog
People are excellent navigators of the physical world, due in part to their remarkable ability to build cognitive maps that form the basis of spatial memory -- from localizing landmarks at varying ontological levels (like a book on a shelf in the living room) to determining whether a layout permits navigation from point A to point B. Building robots that are proficient at navigation requires an interconnected understanding of (a) vision and natural language (to associate landmarks or follow instructions), and (b) spatial reasoning (to connect a map representing an environment to the true spatial distribution of objects). While there have been many recent advances in training joint visual-language models on Internet-scale data, figuring out how to best connect them to a spatial representation of the physical world that can be used by robots remains an open research question. To explore this, we collaborated with researchers at the University of Freiburg and Nuremberg to develop Visual Language Maps (VLMaps), a map representation that directly fuses pre-trained visual-language embeddings into a 3D reconstruction of the environment. VLMaps, which is set to appear at ICRA 2023, is a simple approach that allows robots to (1) index visual landmarks in the map using natural language descriptions, (2) employ Code as Policies to navigate to spatial goals, such as "go in between the sofa and TV" or "move three meters to the right of the chair", and (3) generate open-vocabulary obstacle maps -- allowing multiple robots with different morphologies (mobile manipulators vs. drones, for example) to use the same VLMap for path planning. VLMaps can be used out-of-the-box without additional labeled data or model fine-tuning, and outperforms other zero-shot methods by over 17% on challenging object-goal and spatial-goal navigation tasks in Habitat and Matterport3D.
Visual Language Maps for Robot Navigation
Huang, Chenguang, Mees, Oier, Zeng, Andy, Burgard, Wolfram
Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images to natural language descriptions of object goals, it remains disjoint from the process of mapping the environment, so that it lacks the spatial precision of classic geometric maps. To address this problem, we propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world. VLMaps can be autonomously built from video feed on robots using standard exploration approaches and enables natural language indexing of the map without additional labeled data. Specifically, when combined with large language models (LLMs), VLMaps can be used to (i) translate natural language commands into a sequence of open-vocabulary navigation goals (which, beyond prior work, can be spatial by construction, e.g., "in between the sofa and TV" or "three meters to the right of the chair") directly localized in the map, and (ii) can be shared among multiple robots with different embodiments to generate new obstacle maps on-the-fly (by using a list of obstacle categories). Extensive experiments carried out in simulated and real world environments show that VLMaps enable navigation according to more complex language instructions than existing methods. Videos are available at https://vlmaps.github.io.