Real-Time 3D Vision-Language Embedding Mapping

Rauch, Christian, Ellensohn, Björn, Nwankwo, Linus, Dave, Vedant, Rueckert, Elmar

Aug-11-2025–arXiv.org Artificial Intelligence

A. Vision-Language Models in Robotics In contrast to classic closed-set methods trained on specific labels, novel Vision-Language Models (VLMs) enable the open-set association of images with their text descriptions [6], [12], or other modalities [13], via a common embedding space, using individual transformers for image, text, or other modalities. VLMs have been used in robotics for open-set tracking of objects in the current camera FoV [14], for interactive pose estimation of relevant parts of tools [15], and for navigation via hand-drawn instructions [16]. By focusing on a single task and the current FoV, these approaches cannot generalise to other tasks or operate on a global level, such as localising tools outside the current FoV . In contrast, we integrate the open-set VLM embeddings in a task-agnostic 3D representation in order to enable a variety of interactive robotic use-cases on the same vision-language representation. B. Implicit Neural Representations Due to the availability of vast amounts of 2D images and text, Vision Transformers (ViT) are predominantly trained on 2D image data [17].

artificial intelligence, natural language, representation, (17 more...)

arXiv.org Artificial Intelligence

Aug-11-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan
  - Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe > Austria
  - Styria > Leoben (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Robots (1.00)
  - Vision (1.00)