Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning