CLIP-GEN Overview

#artificialintelligence 

Training a text-to-image generator in the general domain like DALL-E, GauGAN, and CogView requires huge amounts of paired text-image data, which can be problematic and expensive. In this paper, the authors propose a self-supervised scheme named CLIP-GEN for general text-to-image generation with the language-image priors extracted with a pre-trained CLIP model. Only a set of unlabeled images in the general domain is required to train a text-to-image generator. First, the embedding of the image in the united language-vision embedding space is extracted with the CLIP encoder. Next, the image is converted into a sequence of discrete tokens in the VQGAN codebook space (the VQGAN can be trained using unlabeled data).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found