Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?