OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation

Deng, Yinan, Yao, Bicheng, Tang, Yihang, Yang, Yi, Yue, Yufeng

arXiv.org Artificial Intelligence 

-- In recent years, vision-language models (VLMs) have advanced open-vocabulary mapping, enabling mobile robots to simultaneously achieve environmental reconstruction and high-level semantic understanding. While integrated object cognition helps mitigate semantic ambiguity in point-wise feature maps, efficiently obtaining rich semantic understanding and robust incremental reconstruction at the instance-level remains challenging. T o address these challenges, we introduce OpenV ox, a real-time incremental open-vocabulary probabilistic instance voxel representation. In the front-end, we design an efficient instance segmentation and comprehension pipeline that enhances language reasoning through encoding captions. In the back-end, we implement probabilistic instance voxels and formulate the cross-frame incremental fusion process into two subtasks: instance association and live map evolution, ensuring robustness to sensor and segmentation noise. Extensive evaluations across multiple datasets demonstrate that OpenV ox achieves state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval. The project page of OpenV ox is available at https://open-vox.github.io/ . I. INTRODUCTION Accurate 3D scene reconstruction and understanding are essential for robotic downstream tasks.