Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Jan-20-2025, 00:02:48 GMT–Neural Information Processing Systems

Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from languagesupervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from a clear semantic gap between visual and textual modalities: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking.

bridging semantic gap, language-supervised semantic segmentation, rewrite caption semantic, (3 more...)

Neural Information Processing Systems

Jan-20-2025, 00:02:48 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.92)
  - Natural Language > Large Language Model (0.56)