For all GPT-3's flair, its output can feel untethered from reality, as if it doesn't know what it's talking about. By grounding text in images, researchers at OpenAI and elsewhere are trying to give language models a better grasp of the everyday concepts that humans use to make sense of things. DALL·E and CLIP come at this problem from different directions. At first glance, CLIP (Contrastive Language-Image Pre-training) is yet another image recognition system. Except that it has learned to recognize images not from labeled examples in curated data sets, as most existing models do, but from images and their captions taken from the internet.
A neural network uses text captions to create outlandish images – such as armchairs in the shape of avocados – demonstrating it understands how language shapes visual culture. OpenAI, an artificial intelligence company that recently partnered with Microsoft, developed the neural network, which it calls DALL-E. It is a version of the company's GPT-3 language model that can create expansive written works based on short text prompts, but DALL-E produces images instead. "The world isn't just text," says Ilya Sutskever, co-founder of OpenAI. "Humans don't just talk: we also see. A lot of important context comes from looking."
It seems like every few months, someone publishes a machine learning paper or demo that makes my jaw drop. This behemoth 12-billion-parameter neural network takes a text caption (i.e. "an armchair in the shape of an avocado") and generates images to match it: I think its pictures are pretty inspiring (I'd buy one of those avocado chairs), but what's even more impressive is DALL·E's ability to understand and render concepts of space, time, and even logic (more on that in a second). In this post, I'll give you a quick overview of what DALL·E can do, how it works, how it fits in with recent trends in ML, and why it's significant. In July, DALL·E's creator, the company OpenAI, released a similarly huge model called GPT-3 that wowed the world with its ability to generate human-like text, including Op Eds, poems, sonnets, and even computer code.
You've probably never wondered what a knight made of spaghetti would look like, but here's the answer anyway--courtesy of a clever new artificial intelligence program from OpenAI, a company in San Francisco. The program, DALL-E, released earlier this month, can concoct images of all sorts of weird things that don't exist, like avocado armchairs, robot giraffes, or radishes wearing tutus. OpenAI generated several images, including the spaghetti knight, at WIRED's request. DALL-E is a version of GPT-3, an AI model trained on text scraped from the web that's capable of producing surprisingly coherent text. DALL-E was fed images and accompanying descriptions; in response, it can generate a decent mashup image.
When prompted to generate "a mural of a blue pumpkin on the side of a building," OpenAI's new deep ... [ ] learning model DALL-E produces this series of original images. OpenAI has done it again. Earlier this month, OpenAI--the research organization behind last summer's much-hyped language model GPT-3--released a new AI model named DALL-E. While it has generated less buzz than GPT-3 did, DALL-E has even more profound implications for the future of AI. In a nutshell, DALL-E takes text captions as input and produces original images as output. For instance, when fed phrases as diverse as "a pentagonal green clock," "a sphere made of fire" or "a mural of a blue pumpkin on the side of a building," DALL-E is able to generate shockingly accurate visual renderings.