Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation