CLIP in Mirror: Disentangling text from visual images through reflection 2

May-28-2025, 22:03:52 GMT–Neural Information Processing Systems

The CLIP network excels in various tasks, but struggles with text-visual images i.e., images that contain both text and visual objects; it risks confusing textual and visual representations. To address this issue, we propose MirrorCLIP, a zero-shot framework, which disentangles the image features of CLIP by exploiting the difference in the mirror effect between visual objects and text in the images. Specifically, MirrorCLIP takes both original and flipped images as inputs, comparing their features dimension-wise in the latent space to generate disentangling masks. With disentangling masks, we further design filters to separate textual and visual factors more precisely, and then get disentangled representations.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

May-28-2025, 22:03:52 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.69)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.68)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks (1.00)
    - Natural Language (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)