Test-Time Distribution Normalization for Contrastively Learned Visual-language Models

Jan-19-2025, 15:57:09 GMT–Neural Information Processing Systems

Advances in the field of visual-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has quickly garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper however reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should ideally also be in alignment.

contrastively learned visual-language model, distribution normalization, test-time distribution normalization, (5 more...)

Neural Information Processing Systems

Jan-19-2025, 15:57:09 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Visual Languages (0.64)
  - Artificial Intelligence
    - Machine Learning (0.41)
    - Natural Language (0.40)