DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We've found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images. GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.
Generating images from textual descriptions has gained a lot of attention. Recently, DALL-E, a multimodal transformer language model, and its variants have shown high-quality text-to-image generation capabilities with a simple architecture and training objective, powered by large-scale training data and computation. However, despite the interesting image generation results, there has not been a detailed analysis on how to evaluate such models. In this work, we investigate the reasoning capabilities and social biases of such text-to-image generative transformers in detail. First, we measure four visual reasoning skills: object recognition, object counting, color recognition, and spatial relation understanding. For this, we propose PaintSkills, a diagnostic dataset and evaluation toolkit that measures these four visual reasoning skills. Second, we measure the text alignment and quality of the generated images based on pretrained image captioning, image-text retrieval, and image classification models. Third, we assess social biases in the models. For this, we suggest evaluation of gender and racial biases of text-to-image generation models based on a pretrained image-text retrieval model and human evaluation. In our experiments, we show that recent text-to-image models perform better in recognizing and counting objects than recognizing colors and understanding spatial relations, while there exists a large gap between model performances and oracle accuracy on all skills. Next, we demonstrate that recent text-to-image models learn specific gender/racial biases from web image-text pairs. We also show that our automatic evaluations of visual reasoning skills and gender bias are highly correlated with human judgments. We hope our work will help guide future progress in improving text-to-image models on visual reasoning skills and social biases. Code and data at: https://github.com/j-min/DallEval
In this special guest feature, Sahar Mor, founder of AirPaper, discusses DALL-E – a new powerful API from OpenAI that creates images from text captions. With this, Sahar is planning to build a few products such as a chart generator based on text and a text-based tool to generate illustrations for landing pages. Sahar has 12 years of Engineering Product Management experience, both focused on products with AI in their core. Previously he worked as an Engineering Manager in early-stage startups and at the elite Israeli intelligence unit – 8200. Several months ago OpenAI published their latest research model DALL-E – an advanced neural network that generates images from text prompts and a natural progression of its powerful language model GPT-3.
I explain Artificial Intelligence terms and news to non-experts. Dalle mini is amazing -- and YOU can use it! I'm sure you've seen pictures like those in your Twitter feed in the past few days. If you wondered what they were, they are images generated by an AI called DALL·E mini. If you've never seen those, you need to watch this video because you are missing out.
It is NOT a great time for OpenAI right now. It's been just over a month since DALL·E 2 was released and just a few days ago, Google decides to enter the ring with Imagen. In comparison, Imagen is a slap in the face for DALLE·2 mainly because it outperforms DALLE·2 in terms of AI Image generation precision and quality. If by now you are wondering WTF is ImageGen and DALLE·2, how does this technology work in simple lingo, as well as what makes Google's Imagen so superior then take a walk with me through this article as we shall discover together my amigo. So both technologies in simple terms, allow you to generate images from text.