CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Dec-24-2025, 15:01:45 GMT–Neural Information Processing Systems

CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings.

cloob, modern hopfield network, zero-shot transfer, (10 more...)

Neural Information Processing Systems

Dec-24-2025, 15:01:45 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.88)
  - Machine Learning > Neural Networks (0.68)