Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model

Chen, Yi-Chia, Li, Wei-Hua, Chen, Chu-Song

Dec-25-2024–arXiv.org Artificial Intelligence

Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.

machine learning, natural language, segmentation, (17 more...)

arXiv.org Artificial Intelligence

Dec-25-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - New Finding (0.34)
  - Promising Solution (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (1.00)
  - Vision (1.00)