Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning