AITopics | cogview

CogView: Mastering Text-to-Image Generation via Transformers

Neural Information Processing SystemsDec-24-2025, 16:01:21 GMT

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g.

artificial intelligence, machine learning, proceedings, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

A Data Collection and Details about the Tokenizers

Neural Information Processing SystemsNov-15-2025, 08:43:14 GMT

We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB). The sources of data are basically classified into the following categories: (1) Professional image websites (both English and Chinese). The images in the websites are usually with captions. We have already introduced tokenizers in section 2.2, and here are some details. Colored grids are all the tokens attended to by the token marked "O".

caption, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

CogView: Mastering Text-to-Image Generation via Transformers Ming Ding, Zhuoyi Y ang

Neural Information Processing SystemsNov-15-2025, 08:43:10 GMT

Figure 1: Samples generated by CogView.

artificial intelligence, cogview, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.30)

Add feedback

a4d92e2cd541fca87e4620aba658316d-Supplemental.pdf

Neural Information Processing SystemsAug-16-2025, 14:12:13 GMT

caption, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

a4d92e2cd541fca87e4620aba658316d-Paper.pdf

Neural Information Processing SystemsAug-16-2025, 14:12:09 GMT

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.30)

Add feedback

CogView: Mastering Text-to-Image Generation via Transformers

Neural Information Processing SystemsJan-18-2025, 11:46:25 GMT

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

deep learning, machine learning, mastering text-to-image generation, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.32)

Add feedback

CLIP-GEN Overview

#artificialintelligenceMay-9-2022, 23:10:06 GMT

Training a text-to-image generator in the general domain like DALL-E, GauGAN, and CogView requires huge amounts of paired text-image data, which can be problematic and expensive. In this paper, the authors propose a self-supervised scheme named CLIP-GEN for general text-to-image generation with the language-image priors extracted with a pre-trained CLIP model. Only a set of unlabeled images in the general domain is required to train a text-to-image generator. First, the embedding of the image in the united language-vision embedding space is extracted with the CLIP encoder. Next, the image is converted into a sequence of discrete tokens in the VQGAN codebook space (the VQGAN can be trained using unlabeled data).

artificial intelligence, machine learning, transformer, (18 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.38)

Add feedback