Review for NeurIPS paper: Diverse Image Captioning with Context-Object Split Latent Spaces

Neural Information Processing Systems 

Weaknesses: My main issues are with some of the evaluations in the paper: 1. Oracle accuracy is a bit of a cheat as it scores all proposed sentences and selects the top scoring sentence. I notice that consensus re-ranking is also reported in the supplemental. The results on this are good in comparison to prior work, so I am not sure why this is not mentioned in the paper (or if the results could be squeezed into the paper by rearranging Table 3). However, even the consensus based ranking is a bit odd since it relies on finding nearest neighbor train images (how are nearest neighbors found? Stronger networks will do a better job won't they?).