Real-Time 3D Vision-Language Embedding Mapping
Rauch, Christian, Ellensohn, Björn, Nwankwo, Linus, Dave, Vedant, Rueckert, Elmar
–arXiv.org Artificial Intelligence
A. Vision-Language Models in Robotics In contrast to classic closed-set methods trained on specific labels, novel Vision-Language Models (VLMs) enable the open-set association of images with their text descriptions [6], [12], or other modalities [13], via a common embedding space, using individual transformers for image, text, or other modalities. VLMs have been used in robotics for open-set tracking of objects in the current camera FoV [14], for interactive pose estimation of relevant parts of tools [15], and for navigation via hand-drawn instructions [16]. By focusing on a single task and the current FoV, these approaches cannot generalise to other tasks or operate on a global level, such as localising tools outside the current FoV . In contrast, we integrate the open-set VLM embeddings in a task-agnostic 3D representation in order to enable a variety of interactive robotic use-cases on the same vision-language representation. B. Implicit Neural Representations Due to the availability of vast amounts of 2D images and text, Vision Transformers (ViT) are predominantly trained on 2D image data [17].
arXiv.org Artificial Intelligence
Aug-11-2025
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language (1.00)
- Robots (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence