OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation
Schwaiger, Simon, Thalhammer, Stefan, Wöber, Wilfried, Steinbauer-Wagner, Gerald
–arXiv.org Artificial Intelligence
Abstract-- Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Existing vision-language mapping approaches typically rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indistinct class boundaries. We propose OT AS--an Open-vocabulary T oken Alignment method for outdoor Segmentation. OT AS addresses the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pre-trained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OT AS reconstructs a geometrically consistent feature field that supports open-vocabulary segmentation queries. Our method operates in a zero-shot manner, without scene-specific fine-tuning, and achieves real-time performance of up to 17 fps. On the Off-Road Freespace Detection dataset, OT AS yields a modest IoU improvement over fine-tuned and open-vocabulary 2D segmentation baselines. In 3D segmentation on T artanAir, it achieves up to a 151% relative IoU improvement compared to existing open-vocabulary mapping methods. Real-world reconstructions further demonstrate OT AS' applicability to robotic deployment. Understanding the open world through semantics is a key challenge for robotics. Vision-Language Models (VLMs), that ground vision in language, have recently been shown to effectively provide semantics for mapping to facilitate task planning and navigation [1], [2]. However, open-vocabulary semantic mapping methods [3], [4], [5] rely on segmentation priors from general-purpose models to reason about the environment.
arXiv.org Artificial Intelligence
Sep-23-2025
- Genre:
- Research Report (0.64)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Robots (1.00)
- Natural Language > Text Processing (0.68)
- Machine Learning > Statistical Learning
- Clustering (0.68)
- Information Technology > Artificial Intelligence