An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Open in new window