Goto

Collaborating Authors

 conceptfusion


Real-Time 3D Vision-Language Embedding Mapping

arXiv.org Artificial Intelligence

A. Vision-Language Models in Robotics In contrast to classic closed-set methods trained on specific labels, novel Vision-Language Models (VLMs) enable the open-set association of images with their text descriptions [6], [12], or other modalities [13], via a common embedding space, using individual transformers for image, text, or other modalities. VLMs have been used in robotics for open-set tracking of objects in the current camera FoV [14], for interactive pose estimation of relevant parts of tools [15], and for navigation via hand-drawn instructions [16]. By focusing on a single task and the current FoV, these approaches cannot generalise to other tasks or operate on a global level, such as localising tools outside the current FoV . In contrast, we integrate the open-set VLM embeddings in a task-agnostic 3D representation in order to enable a variety of interactive robotic use-cases on the same vision-language representation. B. Implicit Neural Representations Due to the availability of vast amounts of 2D images and text, Vision Transformers (ViT) are predominantly trained on 2D image data [17].


ConceptFusion: Open-set Multimodal 3D Mapping

arXiv.org Artificial Intelligence

Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs