CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling