LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

Girella, Federico, Talon, Davide, Liu, Ziyue, Ruan, Zanxi, Wang, Yiming, Cristani, Marco

Sep-5-2025–arXiv.org Artificial Intelligence

Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcal-ized T ext and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step de-noising process. T o validate our method, we build on Fash-ionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.

artificial intelligence, diffusion model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Sep-5-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.93)

Genre:
- Research Report > Promising Solution (0.93)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Machine Learning > Neural Networks (1.00)