Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning