OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation

Schwaiger, Simon, Thalhammer, Stefan, Wöber, Wilfried, Steinbauer-Wagner, Gerald

Sep-23-2025–arXiv.org Artificial Intelligence

Abstract-- Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Existing vision-language mapping approaches typically rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indistinct class boundaries. We propose OT AS--an Open-vocabulary T oken Alignment method for outdoor Segmentation. OT AS addresses the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pre-trained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OT AS reconstructs a geometrically consistent feature field that supports open-vocabulary segmentation queries. Our method operates in a zero-shot manner, without scene-specific fine-tuning, and achieves real-time performance of up to 17 fps. On the Off-Road Freespace Detection dataset, OT AS yields a modest IoU improvement over fine-tuned and open-vocabulary 2D segmentation baselines. In 3D segmentation on T artanAir, it achieves up to a 151% relative IoU improvement compared to existing open-vocabulary mapping methods. Real-world reconstructions further demonstrate OT AS' applicability to robotic deployment. Understanding the open world through semantics is a key challenge for robotics. Vision-Language Models (VLMs), that ground vision in language, have recently been shown to effectively provide semantics for mapping to facilitate task planning and navigation [1], [2]. However, open-vocabulary semantic mapping methods [3], [4], [5] rely on segmentation priors from general-purpose models to reason about the environment.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Sep-23-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Austria > Vienna (0.14)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Robots (1.00)
  - Natural Language > Text Processing (0.68)
  - Machine Learning > Statistical Learning
    - Clustering (0.68)