Quick-fire Guide to Multi-Modal ML With OpenAI's CLIP

#artificialintelligence 

Contrastive Language-Image Pretraining (CLIP) consists of two models trained in parallel. During training, (image, text) pairs are fed into the respective models, and both output a 512-dimensional vector embedding that represents the respective image/text in vector space. The contrastive component takes these two vector embeddings and calculates the model loss as the difference (e.g., contrast) between the two vectors. Both models are then optimized to minimize this difference and therefore learn how to embed similar (image, text) pairs into a similar vector space. After this contrastive pretraining process, we are left with CLIP, a multi-modal model capable of understanding both language and images via a shared vector space.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found