Aligning What Matters: Masked Latent Adaptation for Text-to-Audio-Video Generation

Open in new window