Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning