Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

May-27-2025, 20:18:47 GMT–Neural Information Processing Systems

Recent controllable generation approaches such as FreeControl and Diffusion Self-Guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model.

controlling structure and appearance, ctrl-x, text-to-image generation, (4 more...)

Neural Information Processing Systems

May-27-2025, 20:18:47 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.74)
  - Artificial Intelligence
    - Vision (0.85)
    - Machine Learning > Neural Networks (0.40)