Enabling Robots to Draw and Tell: Towards Visually Grounded Multimodal Description Generation