Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment