Reviews: A Neural Compositional Paradigm for Image Captioning

Neural Information Processing Systems 

The paper then shows that using such an explicit factorization in the generation mecahnism can help achieve better diversity in captions, and better generalization with lesser amount of data, and really good results on the SPICE metric which focusses more on the syntactic structure of image cpations.