text-to-image synthesis
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Compositional Image Synthesis with Inference-Time Scaling
Ji, Minsuk, Lee, Sanghyeok, Ahn, Namhyuk
ABSTRACT Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge re-ranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. Index T erms-- text-to-image synthesis, inference-time-scaling, object-centric 1. INTRODUCTION Text-to-image (T2I) diffusion models now deliver striking realism and diversity from textual prompts [1, 2, 3, 4], yet they still struggle with compositionality: the precise rendering of object counts, attributes, and spatial relations [5].
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Implicit Inversion turns CLIP into a Decoder
D'Orazio, Antonio, Briglia, Maria Rosaria, Crisostomi, Donato, Loi, Dario, Rodolà, Emanuele, Masi, Iacopo
CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
Although text-to-image (T2I) models exhibit remarkable generation capabilities,they frequently fail to accurately bind semantically related objects or attributesin the input prompts; a challenge termed semantic binding. Previous approacheseither involve intensive fine-tuning of the entire T2I model or require users orlarge language models to specify generation layouts, adding complexity. In thispaper, we define semantic binding as the task of associating a given object with itsattribute, termed attribute binding, or linking it to other related sub-objects, referredto as object binding. We introduce a novel method called Token Merging (ToMe),which enhances semantic binding by aggregating relevant tokens into a singlecomposite token. This ensures that the object, its attributes and sub-objects all sharethe same cross-attention map.
StyleDrop: Text-to-Image Synthesis of Any Style
Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language, and out-of-distribution effects make it hard to synthesize arbitrary image styles, leveraging a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. StyleDrop is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. StyleDrop works by efficiently learning a new style by fine-tuning very few trainable parameters (less than 1\% of total model parameters), and improving the quality via iterative training with either human or automated feedback.