How Imagen Actually Works
While the Machine Learning world was still coming to terms with the impressive results of DALL-E 2, released earlier this year, Google upped the ante by releasing its own text-to-image model Imagen, which appears to push the boundaries of caption-conditional image generation even further. Imagen, released just last month, can generate high-quality, high-resolution images given only a description of a scene, regardless of how logical or plausible such a scene may be in the real world. These impressive results no doubt have many wondering how Imagen actually works. In this article, we'll explain how Imagen works at several levels. First, we will examine Imagen from a bird's-eye view in order to understand its high-level components and how they relate to one another. We'll then go into a bit more detail regarding these components, each with its own subsection, in order to understand how they themselves work. Finally, we'll perform a Deep Dive into Imagen that is intended for Machine Learning researchers, students, and practitioners. Without further ado, let's dive in! In the past few years, there has been a significant amount of progress made in the text-to-image domain of Machine Learning. A text-to-image model takes in a short textual description of a scene and then generates an image which reflects the described scene. An example input description (or "caption") and output image can be seen below: It is important to note that high-performing text-to-image models will necessarily be able to combine unrelated concepts and objects in semantically plausible ways.
Jul-1-2022, 03:05:21 GMT