Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection

Bai, Lichen, Shao, Shitong, Zhou, Zikai, Qi, Zipeng, Xu, Zhiqiang, Xiong, Haoyi, Xie, Zeke

Dec-17-2024–arXiv.org Artificial Intelligence

Style: Position: Color: Counting: Text: Object co-occurrence: A man is cooking, A sheep to the right of a A photo of a yellow dining A photo of two bears A sign that says'Diffusion'. Figure 1: The qualitative results of Z-Sampling demonstrate the effectiveness of our method in various aspects, such as style, position, color, counting, text rendering, and object co-occurrence. Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we propose diffusion self-reflection that alternately performs denoising and inversion and demonstrate that such diffusion self-reflection can leverage the guidance gap between denoising and inversion to capture prompt-related semantic information with theoretical and empirical evidence. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel self-reflection-based diffusion sampling method that leverages the guidance gap between denosing and inversion to accumulate semantic information step by step along the sampling path, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding and computational costs. Third, our extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. Moreover, Z-Sampling can further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO. One key ability of diffusion models is to guide the sampling path based on some conditions (e.g., texts), leading to conditional or controllable generation (Ho & Salimans, 2022). However, while strong guidance may improve semantic alignment to those challenging prompts, it often causes significant decline in image fidelity, leading to mode collapse, and resulting inevitable accumulation of errors during the sampling process (Chung et al., 2024). To mitigate this issue, some studies apply additional manifold constraints to the sampling paths (Chung et al., 2024; Yang et al.;

large language model, machine learning, z-sampling, (19 more...)

arXiv.org Artificial Intelligence

Dec-17-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.81)

Industry:
- Leisure & Entertainment (0.67)
- Media > Photography (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)
  - Natural Language
    - Large Language Model (0.87)
    - Text Processing (0.72)