Enhance the Robustness of Text-Centric Multimodal Alignments