Compositional Image Synthesis with Inference-Time Scaling

Ji, Minsuk, Lee, Sanghyeok, Ahn, Namhyuk

Oct-29-2025–arXiv.org Artificial Intelligence

ABSTRACT Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge re-ranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. Index T erms-- text-to-image synthesis, inference-time-scaling, object-centric 1. INTRODUCTION Text-to-image (T2I) diffusion models now deliver striking realism and diversity from textual prompts [1, 2, 3, 4], yet they still struggle with compositionality: the precise rendering of object counts, attributes, and spatial relations [5].

diffusion model, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

Oct-29-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found