Revealing Multimodal Contrastive Representation Learning through Latent Partial Causal Models