Reviews: Unsupervised Learning of Spoken Language with Visual Context

Neural Information Processing Systems 

This is interesting work that is pointing into the right direction, but a few aspects of this paper are a bit problematic: 1) It would have been useful (or interesting) to use a corpus that has existing text captions, and either have users re-speak the text captions, or collect additional captions. The data collections seems generally well thought-out, but why was the Places205 data set used? Prompted speech (such as collected here) is not "spontaneous", otherwise the WSJ recognizer would not have given 20 % WER (this aspect is irrelevant for the purpose of this paper, though, I think). Typically, multiple captions are being generated for a single image. Has this been done here as well? Or is there only a single caption for each image?