An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models