Quick-fire Guide to Multi-Modal ML With OpenAI's CLIP

Aug-12-2022, 00:20:22 GMT–#artificialintelligence

Contrastive Language-Image Pretraining (CLIP) consists of two models trained in parallel. During training, (image, text) pairs are fed into the respective models, and both output a 512-dimensional vector embedding that represents the respective image/text in vector space. The contrastive component takes these two vector embeddings and calculates the model loss as the difference (e.g., contrast) between the two vectors. Both models are then optimized to minimize this difference and therefore learn how to embed similar (image, text) pairs into a similar vector space. After this contrastive pretraining process, we are left with CLIP, a multi-modal model capable of understanding both language and images via a shared vector space.

#artificialintelligence

Aug-12-2022, 00:20:22 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (0.43)
    - Chatbot (0.43)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.43)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found