An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Hu, Zizhao, Jia, Shaochong, Rostami, Mohammad

Mar-25-2024–arXiv.org Artificial Intelligence

Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost textto-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared against a strong U-ViT baseline with an early fusion.

diffusion model, fusion, intermediate fusion, (14 more...)

arXiv.org Artificial Intelligence

Mar-25-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning > Information Fusion (0.87)
    - Machine Learning > Neural Networks
      - Deep Learning (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found