Understanding the Emergence of Multimodal Representation Alignment