Goto

Collaborating Authors

 co-cv ae





Diverse Image Captioning with Context Object Split Latent Spaces

Neural Information Processing Systems

The word dimension for the embedding layer is 300. In Tab. 7 we further evaluate the diversity of COS-CVAE using self-CIDEr We provide additional qualitative results in Tabs. In Tab. 12 we show the divserse captions for novel objects generated by our model and the regions The evaluation server for nocaps accepts only one caption per image and does not support methods modeling one-to-many relationships for images and captions. In Figure 1 (left) we show the average accuracy and diversity scores again averaged across annotators; in Figure 1 (right) we show the accuracy and diversity scores from each annotator. We find that the captions generated by the COS-CV AE are scored to be more accurate compared to COS-CV AE (paired).



We thank all the reviewers for their helpful comments and for recognizing the novelty of our approach (R2-4) and its

Neural Information Processing Systems

We are glad that the reviewers found our experimental setup exhaustive (R1-4). This is not feasible with prior work, e . We will clarify and highlight these challenges in the final version. Random samples in Tab. 2, 11, and 12 show that the captions from COS-CV AE are coherent (ll. COS-CV AE has a score of 0.742 while Seq-CV AE(attn) has 0.714.