Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment