OpenAI and the road to text-guided image generation: DALL·E, CLIP, GLIDE, DALL·E 2 (unCLIP)

May-29-2022, 10:25:14 GMT–#artificialintelligence

The first version of DALL·E was a GPT-3 style transformer decoder that autoregressively generated a 256 256 image based on textual input and an optional beginning of the image. If you want to understand how a GPT-like transformer works, here is a great visual explanation by Jay Alammar. A text is encoded by BPE-tokens (max. Because of the dVAE, some details and high-frequency features are lost in generated images, so some blurriness and smoothness are the features of the DALL·E-generated images. The transformer is a large model with 12B parameters. It consisted of 64 sparse transformer blocks with a complicated set of attention mechanisms inside, consisting of 1) classical text-to-text masked attention, 2) image-to-text attention, and 3) image-to-image sparse attention.

alternative code, github, text-guided image generation, (4 more...)

#artificialintelligence

May-29-2022, 10:25:14 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found