SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation