PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

Tian, Feng, Li, Yixuan, Yan, Yichao, Guan, Shanyan, Ge, Yanhao, Yang, Xiaokang

Oct-7-2024–arXiv.org Artificial Intelligence

Large text-to-image diffusion models Saharia et al. (2022); Pernias et al. (2024); Podell et al. (2024); Ramesh et al. (2022) have demonstrated significant capabilities in generating photorealistic images based on given textual prompts, facilitating both the creation and editing of real images. Current research Cao et al. (2023); Brack et al. (2024); Ju et al. (2024); Parmar et al. (2023); Wu & la Torre (2022); Xu et al. (2024) highlights three main challenges in image editing: controllability, background preservation, and efficiency. Specifically, the edited parts must align with the target prompt's concepts, while unedited regions should remain unchanged. Additionally, the editing process must be sufficiently efficient to support interactive tasks. There are two mainstream categories of image editing approaches, namely inversion-based and inversion-free methods, as illustrated in Figure 1. Inversion-based approaches Song et al. (2021a); Mokady et al. (2023); Wu & la Torre (2022); Huberman-Spiegelglas et al. (2024) progressively add noise to a clean image and then remove the noise conditioned on a given target prompt, utilizing large text-to-image diffusion models (i.e. Stable Diffusion Rombach et al. (2022)), to obtain the edited image. However, directly inverting the diffusion sampling process (e.g., DDIM Song et al. (2021a)) for reconstruction introduces bias from the initial image due to errors accumulated by an unconditional score term, as discussed in classifier-free guidance (CFG) Ho & Salimans (2022) and proven in App.

artificial intelligence, diffusion model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Oct-7-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Industry:
- Media > Photography (0.83)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.48)
  - Natural Language > Large Language Model (0.40)