FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding Dong Jing

May-29-2025, 00:18:44 GMT–Neural Information Processing Systems

Contrastive Language-Image Pre-training (CLIP) achieves impressive performance on tasks like image classification and image-text retrieval by learning on largescale image-text datasets. However, CLIP struggles with dense prediction tasks due to the poor grasp of the fine-grained details. Although existing works pay attention to this issue, they achieve limited improvements and usually sacrifice the important visual-semantic consistency. To overcome these limitations, we propose FineCLIP, which keeps the global contrastive learning to preserve the visualsemantic consistency and further enhances the fine-grained understanding through two innovations: 1) A real-time self-distillation scheme that facilitates the transfer of representation capability from global to local features.

fineclip, large language model, machine learning, (22 more...)

Neural Information Processing Systems

May-29-2025, 00:18:44 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.28)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks (0.68)
    - Natural Language
      - Large Language Model (0.47)
      - Text Processing (0.46)
    - Representation & Reasoning (1.00)
    - Vision > Image Understanding (0.48)
  - Sensing and Signal Processing > Image Processing (1.00)