SITTA: A Semantic Image-Text Alignment for Image Captioning