Globality Strikes Back: Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Wang, Jingyun, Yan, Cilin, Kang, Guoliang

Feb-4-2025–arXiv.org Artificial Intelligence

Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In CLIP, patch-wise image representations mainly encode the homogeneous image-level properties and thus are not discriminative enough, hindering its application to the dense prediction task. Previous works make image features more distinct across patches, through making each patch mainly attend to itself or the neighboring patches within a narrow local window. However, with their modifications, the ability of CLIP to aggregate global context information, which is known to be useful for distinguishing confusing categories, is largely weakened. In this paper, we propose a new method named GCLIP, which mines the beneficial global knowledge of CLIP to facilitate the TF-OVSS task. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. In GCLIP, we merge the attention from the global token emerging blocks with the Query-Query attention to realize this goal. Secondly, we aim to make the Value embeddings of the last-block attention module more distinct and semantically correlated. To realize this, we design a novel channel suppression strategy. As the representation of each patch is finally determined by the attention weights and the Value embeddings, our method can generate more discriminative patch-level image features while absorbing global context information. Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.

artificial intelligence, machine learning, segmentation, (14 more...)

arXiv.org Artificial Intelligence

Feb-4-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Poland (0.04)
- Asia > Middle East
  - Israel > Tel Aviv District > Tel Aviv (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found