Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Zhang, Yuhui, McKinzie, Brandon, Gan, Zhe, Shankar, Vaishaal, Toshev, Alexander

Nov-27-2023–arXiv.org Artificial Intelligence

Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.

image token, language model, text token, (13 more...)

arXiv.org Artificial Intelligence

Nov-27-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)