Collaborating Authors

This avocado armchair could be the future of AI

MIT Technology Review

For all GPT-3's flair, its output can feel untethered from reality, as if it doesn't know what it's talking about. By grounding text in images, researchers at OpenAI and elsewhere are trying to give language models a better grasp of the everyday concepts that humans use to make sense of things. DALL·E and CLIP come at this problem from different directions. At first glance, CLIP (Contrastive Language-Image Pre-training) is yet another image recognition system. Except that it has learned to recognize images not from labeled examples in curated data sets, as most existing models do, but from images and their captions taken from the internet.

DALL·E: Creating Images from Text


DALL·E[1] is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We've found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images. GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.

DALL·E Explained in Under 5 Minutes


It seems like every few months, someone publishes a machine learning paper or demo that makes my jaw drop. This behemoth 12-billion-parameter neural network takes a text caption (i.e. "an armchair in the shape of an avocado") and generates images to match it: I think its pictures are pretty inspiring (I'd buy one of those avocado chairs), but what's even more impressive is DALL·E's ability to understand and render concepts of space, time, and even logic (more on that in a second). In this post, I'll give you a quick overview of what DALL·E can do, how it works, how it fits in with recent trends in ML, and why it's significant. In July, DALL·E's creator, the company OpenAI, released a similarly huge model called GPT-3 that wowed the world with its ability to generate human-like text, including Op Eds, poems, sonnets, and even computer code.

This AI Could Go From 'Art' to Steering a Self-Driving Car


You've probably never wondered what a knight made of spaghetti would look like, but here's the answer anyway--courtesy of a clever new artificial intelligence program from OpenAI, a company in San Francisco. The program, DALL-E, released earlier this month, can concoct images of all sorts of weird things that don't exist, like avocado armchairs, robot giraffes, or radishes wearing tutus. OpenAI generated several images, including the spaghetti knight, at WIRED's request. DALL-E is a version of GPT-3, an AI model trained on text scraped from the web that's capable of producing surprisingly coherent text. DALL-E was fed images and accompanying descriptions; in response, it can generate a decent mashup image.

AI Image Synthesis: What The Future Holds


Originally published at Ross Dawson. Shortly after the new year 2021, the Media Synthesis community at Reddit began to become more than usually psychedelic. The board became saturated with unearthly images depicting rivers of blood, Picasso's King Kong, a Pikachu chasing Mark Zuckerberg, Synthwave witches, acid-induced kittens, an inter-dimensional portal, the industrial revolution and the possible child of Barack Obama and Donald Trump. The bizarre images were generated by inputting short phrases into Google Colab notebooks (web pages from which a user can access the formidable machine learning resources of the search giant), and letting the trained algorithms compute possible images based on that text. In most cases, the optimal results were obtained in minutes. Various attempts at the same phrase would usually produce wildly different results. In the image synthesis field, this free-ranging facility of invention is something new; not just a bridge between the text and image domains, but an early look at comprehensive AI-driven image generation systems that don't need hyper-specific training in very limited domains (i.e. NVIDIA's landscape generation framework GauGAN [on which, more later], which can turn sketches into landscapes, but only into landscapes; or the various sketch face Pix2Pix projects, that are likewise'specialized'). Example images generated with the Big Sleep Colab notebook [12].