X&Fuse: Fusing Visual Information in Text-to-Image Generation
Kirstain, Yuval, Levy, Omer, Polyak, Adam
–arXiv.org Artificial Intelligence
We introduce X&Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X&Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve&Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art FID score of 6.65 in zero-shot settings. (ii) When cropped-object images are at hand, we utilize them and perform subject-driven generation (Crop&Fuse), outperforming the textual inversion method while being more than x100 faster. (iii) Having oracle access to the image scene (Scene&Fuse), allows us to achieve an FID score of 5.03 on MS-COCO in zero-shot settings. Our experiments indicate that X&Fuse is an effective, easy-to-adapt, simple, and general approach for scenarios in which the model may benefit from additional visual information.
arXiv.org Artificial Intelligence
Mar-2-2023
- Country:
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Genre:
- Research Report (0.64)
- Technology: