rendition
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text.Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications.
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text.Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions.
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text.Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions.
The Translation of Circumlocution in Arabic Short Stories into English
This study aims at identifying and analyzing circumlocution categories and subcategories in the (SL) and their renditions into the (TL).It is based on criteria proposed for inclusion and exclusion of circumlocution.This study is concerned with the translation of literary texts, specifically short stories, from Arabic into English. It draws on four short stories selected from Arabic famous writers and their parallel translations into English. It hypothesizes that Arabic categories of circumlocution are applicable to English categories of metadiscourse, which include textual and interpersonal items. Nida's (1964) model is adopted in this study to judge the appropriateness in translation the study shows that the translators made serious decisions while opting for various techniques such as addition, subtraction and alteration. In this sense, it investigates whether the translators have successfully and appropriately managed to render the concept of Arabic circumlocution into English or not. The main problems that led to the inappropriate translations were also identified. This study concludes that there are lots of similarities between the categories of circumlocution in Arabic and the categories of metadiscourse in English. These similarities are clear when appropriate renditions are achieved.
- Asia > Middle East > Lebanon > Beirut Governorate > Beirut (0.05)
- Asia > Middle East > Iraq > Nineveh Governorate > Mosul (0.05)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.05)
- (10 more...)
My Imagination Is on Steroids Now
What if The Atlantic owned a train car? Amtrak, I had just learned on the internet, allows owners of private railcars to lash onto runs along the Northeast Corridor, among other routes. "We should have a train car," I slacked an editor. Moments later, it appeared on my screen, bright red with our magazine's logo emblazoned in white, just like I'd ordered. It's an old logo, and misspelled, but the effect was the same: A momentary notion--one unworthy of relating to someone in private, let alone executing--had been realized, thanks to DALL-E 3, an artificial-intelligence image generator now built into Microsoft Bing's Image Creator website.
- Transportation > Ground > Rail (0.76)
- Government > Regional Government > North America Government > United States Government (0.35)
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Ruiz, Nataniel, Li, Yuanzhen, Jampani, Varun, Pritch, Yael, Rubinstein, Michael, Aberman, Kfir
Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for "personalization" of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features. We also provide a new dataset and evaluation protocol for this new task of subject-driven generation. Project page: https://dreambooth.github.io/
Controlling High-Dimensional Data With Sparse Input
Iliescu, Dan Andrei, Mohan, Devang Savita Ram, Teh, Tian Huey, Hodari, Zack
We address the problem of human-in-the-loop control for generating highly-structured data. This task is challenging because existing generative models lack an efficient interface through which users can modify the output. Users have the option to either manually explore a non-interpretable latent space, or to laboriously annotate the data with conditioning labels. To solve this, we introduce a novel framework whereby an encoder maps a sparse, human interpretable control space onto the latent space of a generative model. We apply this framework to the task of controlling prosody in text-to-speech synthesis. We propose a model, called Multiple-Instance CVAE (MICVAE), that is specifically designed to encode sparse prosodic features and output complete waveforms. We show empirically that MICVAE displays desirable qualities of a sparse human-in-the-loop control mechanism: efficiency, robustness, and faithfulness. With even a very small number of input values (~4), MICVAE enables users to improve the quality of the output significantly, in terms of listener preference (4:1).
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (6 more...)
How to Use AI Tools Like ChatGPT in Your Business
Artificial intelligence is not only altering the course of the internet but also impacting the future of business. While some fear that it will have harmful economic repercussions by replacing people in jobs, AI can also serve as a game-changing tool to grow a business and increase its efficiency -- help with everything from lead generation to content creation. Launched by OpenAI in November of 2022, this chatbot amassed more than a million users in just five days. A generative dialogue AI application, it can create new content, and its potential uses are virtually endless -- from writing full essays to blog posts, song lyrics to cover letters and resumes. It can even draft legal contracts using local statutes/regulations pulled from public sources.
GoogleAI launches YouTube Channel for Free Resources on AI/ML
Google AI announced the launch of the Google Research YouTube channel today. The channel is set to focus on a wide range of subjects like AI/ML, robotics, theory and algorithms, quantum computing, health and bioscience. The show will introduce viewers to a Google Researcher who will explain about their innovations and the implications of the newly emerging technologies in our daily lives. In its first rendition, Drew Calcagno spoke with Google Researchers Sharan Narang and Aakanksha Chowdhery, who theorised and introduced the language models to robotics and coded the Pathways Language Model (PaLM). This series focuses on the conversion of research publications by Google researchers into byte-sized content for viewers.