Goto

Collaborating Authors

 foundational language


Review -- FLAVA: A Foundational Language And Vision Alignment Model

#artificialintelligence

The image-text contrastive loss resembles that of CLIP. Given a batch of images and text, the cosine similarities between matched image and text pairs are maximized and those for the unmatched pairs are minimized. In this paper, it is found that a noticeable performance gain by performing full backpropagation across GPUs. That's why it is called Global Contrastive (GC) Loss. Given an image and text input, the input image patches are first tokenized using a pretrained dVAE tokenizer, as in DALL·E, which maps each image patch into an index in a visual codebook similar to a word dictionary.