clip interrogator
Reinforcement Learning-Based Prompt Template Stealing for Text-to-Image Models
Multimodal Large Language Models (MLLMs) have transformed text-to-image workflows, allowing designers to create novel visual concepts with unprecedented speed. This progress has given rise to a thriving prompt trading market, where curated prompts that induce trademark styles are bought and sold. Although commercially attractive, prompt trading also introduces a largely unexamined security risk: the prompts themselves can be stolen. In this paper, we expose this vulnerability and present RLStealer, a reinforcement learning based prompt inversion framework that recovers its template from only a small set of example images. RLStealer treats template stealing as a sequential decision making problem and employs multiple similarity based feedback signals as reward functions to effectively explore the prompt space. Comprehensive experiments on publicly available benchmarks demonstrate that RLStealer gets state-of-the-art performance while reducing the total attack cost to under 13% of that required by existing baselines. Our further analysis confirms that RLStealer can effectively generalize across different image styles to efficiently steal unseen prompt templates. Our study highlights an urgent security threat inherent in prompt trading and lays the groundwork for developing protective standards in the emerging MLLMs marketplace.
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Kim, Donghoon, Bae, Minji, Shim, Kyuhong, Shim, Byonghyo
Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models. Figure 1: Visually Guided Decoding ( VGD) works with any LLM without extra training, making it easy to integrate into a chat-based interface that offers interpretable and controllable text-to-image generation. In recent years, image generative models such as DALL-E and Stable Diffusion have shown remarkable success in generating high-fidelity images (Ramesh et al., 2022; Rombach et al., 2022; Podell et al., 2024). These models are widely used in a variety of applications, including visual content generation ( e.g., advertisement, movie, game), personalized content generation ( e.g., caricature, photo editing), and prototyping ( e.g., architecture and product design).
Hidden traces of humanity: what AI images reveal about our world
When faced with a bit of downtime, many of my friends will turn to the same party game. It's based on the surrealist game Exquisite Corpse, and involves translating brief written descriptions into rapidly made drawings and back again. One group calls it Telephone Pictionary; another refers to it as Writey-Drawey. The internet tells me it is also called Eat Poop You Cat, a sequence of words surely inspired by one of the game's results. As recently as three years ago, it was rare to encounter text-to-image or image-to-text mistranslations in daily life, which made the outrageous outcomes of the game feel especially novel. But we have since entered a new era of image-making. With the aid of AI image generators like Dall-E 3, Stable Diffusion and Midjourney, and the generative features integrated into Adobe's Creative Cloud programs, you can now transform a sentence or phrase into a highly detailed image in mere seconds. Images, likewise, can be nearly instantly translated into descriptive text.
Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models
Brade, Stephen, Wang, Bryan, Sousa, Mauricio, Oore, Sageev, Grossman, Tovi
Text-to-image generative models have demonstrated remarkable capabilities in generating high-quality images based on textual prompts. However, crafting prompts that accurately capture the user's creative intent remains challenging. It often involves laborious trial-and-error procedures to ensure that the model interprets the prompts in alignment with the user's intention. To address the challenges, we present Promptify, an interactive system that supports prompt exploration and refinement for text-to-image generative models. Promptify utilizes a suggestion engine powered by large language models to help users quickly explore and craft diverse prompts. Our interface allows users to organize the generated images flexibly, and based on their preferences, Promptify suggests potential changes to the original prompt. This feedback loop enables users to iteratively refine their prompts and enhance desired features while avoiding unwanted ones. Our user study shows that Promptify effectively facilitates the text-to-image workflow and outperforms an existing baseline tool widely used for text-to-image generation.
Prompt Stealing Attacks Against Text-to-Image Generation Models
Shen, Xinyue, Qu, Yiting, Backes, Michael, Zhang, Yang
Text-to-Image generation models have revolutionized the artwork design process and enabled anyone to create high-quality images by entering text descriptions called prompts. Creating a high-quality prompt that consists of a subject and several modifiers can be time-consuming and costly. In consequence, a trend of trading high-quality prompts on specialized marketplaces has emerged. In this paper, we propose a novel attack, namely prompt stealing attack, which aims to steal prompts from generated images by text-to-image generation models. Successful prompt stealing attacks direct violate the intellectual property and privacy of prompt engineers and also jeopardize the business model of prompt trading marketplaces. We first perform a large-scale analysis on a dataset collected by ourselves and show that a successful prompt stealing attack should consider a prompt's subject as well as its modifiers. We then propose the first learning-based prompt stealing attack, PromptStealer, and demonstrate its superiority over two baseline methods quantitatively and qualitatively. We also make some initial attempts to defend PromptStealer. In general, our study uncovers a new attack surface in the ecosystem created by the popular text-to-image generation models. We hope our results can help to mitigate the threat. To facilitate research in this field, we will share our dataset and code with the community.
This New AI Tool Feeds on Hurting the Egos of its Users
There is a trending new AI tool on the block, but instead of creating images, this AI tool analyzes them and spits out crude roasts of anyone they depict. Days are gone netizens leveraged DALL·E 2 and racist image-spewing DALL·E mini to generate silly art for Twitter shit posting. The new AI tool, known as the CLIP Interrogator and created by a generative artist who goes by the handle Pharma psychotic, is technically an artificial intelligence powered tool to discover "what a good prompt might be to generate new AI art like an existing one." In reality, CLIP Interrogator tends to spit out descriptions of people that can be mundane, puzzling, staggeringly insensitive, and, at times, admittedly a bit hilarious. The new AI tool told one user she looked "tired and drunk," for instance, and accused another user of having a "deformed face."