Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Jan-19-2025, 05:30:30 GMT–Neural Information Processing Systems

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This issue mainly relates to the fact that CLIP is trained with image-level supervision.

learning mask-aware clip representation, mask proposal, zero-shot segmentation, (4 more...)

Neural Information Processing Systems

Jan-19-2025, 05:30:30 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)