Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

Dec-25-2025, 19:12:10 GMT–Neural Information Processing Systems

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model.

control signal, diffusion model, mixing multi-modality control, (9 more...)

Neural Information Processing Systems

Dec-25-2025, 19:12:10 GMT

Conferences Web Page

Add feedback

Country:
- Asia > China > Heilongjiang Province > Daqing (0.07)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)