imagen text-to-image diffusion model
Google's Imagen Text-to-Image Diffusion Model With Deep Language Understanding Defeats DALL-E 2
Text-to-image diffusion models that can generate and edit photorealistic images have become a hot AI research area, with their incredible synthetic images garnering widespread mainstream media coverage. An advanced image generation approach, diffusion models have surpassed previous high-performance methods such as GANs (generative adversarial networks) in both image fidelity and diversity and are now demonstrating their potential in text-to-image generation. In the new paper Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, a Google Brain research team advances this research field with Imagen, a text-to-image diffusion model that combines the deep language understanding of transformer-based large language models and the photorealistic image generation capabilities of diffusion models to achieve a new state-of-the-art FID score of 7.27 on the COCO dataset. Imagen's training data was drawn from massive datasets of image and English alt-text pairs. Like previous text-to-image models, Imagen's "wow" factor lies in its ability to generate photorealistic and high-resolution images from fanciful prompts such as "A cute corgi lives in a house made out of sushi" or "A dragon fruit wearing a karate belt in the snow."