On the Importance of Contrastive Loss in Multimodal Learning