Goto

Collaborating Authors

 text-guided image manipulation


Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

Neural Information Processing Systems

We propose a novel lightweight generative adversarial network for efficient image manipulation using natural language descriptions. To achieve this, a new word-level discriminator is proposed, which provides the generator with fine-grained training feedback at word-level, to facilitate training a lightweight generator that has a small number of parameters, but can still correctly focus on specific visual attributes of an image, and then edit them without affecting other contents that are not described in the text. Furthermore, thanks to the explicit training signal related to each word, the discriminator can also be simplified to have a lightweight structure. Compared with the state of the art, our method has a much smaller number of parameters, but still achieves a competitive manipulation performance. Extensive experimental results demonstrate that our method can better disentangle different visual attributes, then correctly map them to corresponding semantic words, and thus achieve a more accurate image modification using natural language descriptions.


Review for NeurIPS paper: Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

Neural Information Processing Systems

Weaknesses: - The technical novelty of the proposed method is somewhat incremental since it is largely based on the work from [14] with some modifications to the generator and the discriminator architectures. The word-level training feedback in the discriminator seems to be the main technical contribution, but is not ground-breaking as it extends the auxiliary classifier in conditional GAN with multiple classes (i.e. Specifically, only the nouns and adjectives are chosen manually as text-relevant attributes, which convey a very limited context of general descriptions. Although it may allow a fine-control of the image content in a limited context, it reduces the capability of aligning rich context of the text to the image, often available in approaches learning to encode the whole sentence (e.g. Although authors made some justifications in Section 3.2.1 of using heuristic approach, it does not feel that this assumption holds in general. Current comparisons are mostly focused on ManiGAN.


Review for NeurIPS paper: Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

Neural Information Processing Systems

The paper proposes a novel text-guided image manipulation method by proposing word-level discriminator loss. The proposed method is faster and requires less memory compared to existing models, and the experimental results show improvements over the baseline method (MainGAN). The paper initially received mixed ratings but the concerns were addressed by the rebuttal and all reviewers converged in favor of acceptance. The authors should revise the paper reflecting the reviewers' suggestions and as promised by the rebuttal. NOTE FROM PROGRAM CHAIRS: For the camera-ready version, please expand your broader impact statement to discuss the potential negative impacts of your work, such as forgery and deepfakes, as well as possible mitigations.


Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

Neural Information Processing Systems

We propose a novel lightweight generative adversarial network for efficient image manipulation using natural language descriptions. To achieve this, a new word-level discriminator is proposed, which provides the generator with fine-grained training feedback at word-level, to facilitate training a lightweight generator that has a small number of parameters, but can still correctly focus on specific visual attributes of an image, and then edit them without affecting other contents that are not described in the text. Furthermore, thanks to the explicit training signal related to each word, the discriminator can also be simplified to have a lightweight structure. Compared with the state of the art, our method has a much smaller number of parameters, but still achieves a competitive manipulation performance. Extensive experimental results demonstrate that our method can better disentangle different visual attributes, then correctly map them to corresponding semantic words, and thus achieve a more accurate image modification using natural language descriptions.


DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models

arXiv.org Artificial Intelligence

Diffusion models are recent generative models that have shown great success in image generation with the state-of-the-art performance. However, only a few researches have been conducted for image manipulation with diffusion models. Here, we present a novel DiffusionCLIP which performs text-driven image manipulation with diffusion models using Contrastive Language-Image Pre-training (CLIP) loss. Our method has a performance comparable to that of the modern GAN-based image processing methods for in and out-of-domain image processing tasks, with the advantage of almost perfect inversion even without additional encoders or optimization. Furthermore, our method can be easily used for various novel applications, enabling image translation from an unseen domain to another unseen domain or stroke-conditioned image generation in an unseen domain, etc. Finally, we present a novel multiple attribute control with DiffusionCLIPby combining multiple fine-tuned diffusion models.