On the Language Encoder of Contrastive Cross-modal Models