Goto

Collaborating Authors

 cloob



CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Neural Information Processing Systems

CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings.



An experimental approach on Few Shot Class Incremental Learning

Adam, Marinela

arXiv.org Artificial Intelligence

Few-Shot Class-Incremental Learning (FSCIL) represents a cutting-edge paradigm within the broader scope of machine learning, designed to empower models with the ability to assimilate new classes of data with limited examples while safeguarding existing knowledge. The paper will present different solutions which contain extensive experiments across large-scale datasets, domain shifts, and network architectures to evaluate and compare the selected methods. We highlight their advantages and then present an experimental approach with the purpose of improving the most promising one by replacing the visual-language (V-L) model (CLIP) with another V-L model (CLOOB) that seem to outperform it on zero-shot learning tasks. The aim of this report is to present an experimental method for FSCIL that would improve its performance. We also plan to offer an overview followed by an analysis of the recent advancements in FSCIL domain, focusing on various strategies to mitigate catastrophic forgetting and improve the adaptability of models to evolving tasks and datasets.


CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Neural Information Processing Systems

CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings.


Topological Perspectives on Optimal Multimodal Embedding Spaces

B, Abdul Aziz A., Rahim, A. B Abdul

arXiv.org Artificial Intelligence

Recent strides in multimodal model development have ignited a paradigm shift in the realm of text-to-image generation. Among these advancements, CLIP stands out as a remarkable achievement which is a sophisticated autoencoder adept at encoding both textual and visual information within a unified latent space. This paper delves into a comparative analysis between CLIP and its recent counterpart, CLOOB. To unravel the intricate distinctions within the embedding spaces crafted by these models, we employ topological data analysis. Our approach encompasses a comprehensive examination of the modality gap drivers, the clustering structures existing across both high and low dimensions, and the pivotal role that dimension collapse plays in shaping their respective embedding spaces. Empirical experiments substantiate the implications of our analyses on downstream performance across various contextual scenarios. Through this investigation, we aim to shed light on the nuanced intricacies that underlie the comparative efficacy of CLIP and CLOOB, offering insights into their respective strengths and weaknesses, and providing a foundation for further refinement and advancement in multimodal model research.


CLOOB: A New Contrastive Learning Method That Outperforms CLIP - AI Summary

#artificialintelligence

The paper "CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP" introduces a new self-supervised learning method, where modern Hopfield networks boost contrastive learning using the InfoLOOB objective (Leave One Out Bound). CLOOB consistently outperforms CLIP at zero-shot transfer learning across different architectures and datasets.


CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Fürst, Andreas, Rumetshofer, Elisabeth, Lehner, Johannes, Tran, Viet, Tang, Fei, Ramsauer, Hubert, Kreil, David, Kopp, Michael, Klambauer, Günter, Bitto-Nemling, Angela, Hochreiter, Sepp

arXiv.org Artificial Intelligence

CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings. However, modern Hopfield networks increase the saturation effect of the InfoNCE objective which hampers learning. We propose to use the InfoLOOB objective to mitigate this saturation effect. We introduce the novel "Contrastive Leave One Out Boost" (CLOOB), which uses modern Hopfield networks for covariance enrichment together with the InfoLOOB objective. In experiments we compare CLOOB to CLIP after pre-training on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.


CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

#artificialintelligence

Similar to the associative memory of humans, our approach uses associative memories to amplify co-occurences and the covariance structure. The associative memory of our choice is a modern Hopfield network because of its fast retrieval and high storage capacity, as shown in Hopfield networks is all you need. The update mechanism of modern Hopfield networks is equivalent to the self-attention mechanism of Transformer networks. However, modern Hopfield networks are more general and have a broader functionality, of which the Transformer self-attention is just one example. The according Hopfield layers can be built in Deep Learning architectures for associating two sets, encoder-decoder attention, multiple instance learning, or averaging and pooling operations.