cogview
CogView: Mastering Text-to-Image Generation via Transformers
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g.
A Data Collection and Details about the Tokenizers
We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB). The sources of data are basically classified into the following categories: (1) Professional image websites (both English and Chinese). The images in the websites are usually with captions. We have already introduced tokenizers in section 2.2, and here are some details. Colored grids are all the tokens attended to by the token marked "O".
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.30)
CogView: Mastering Text-to-Image Generation via Transformers
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.
CLIP-GEN Overview
Training a text-to-image generator in the general domain like DALL-E, GauGAN, and CogView requires huge amounts of paired text-image data, which can be problematic and expensive. In this paper, the authors propose a self-supervised scheme named CLIP-GEN for general text-to-image generation with the language-image priors extracted with a pre-trained CLIP model. Only a set of unlabeled images in the general domain is required to train a text-to-image generator. First, the embedding of the image in the united language-vision embedding space is extracted with the CLIP encoder. Next, the image is converted into a sequence of discrete tokens in the VQGAN codebook space (the VQGAN can be trained using unlabeled data).