Reviews: Unsupervised Learning of Spoken Language with Visual Context

Jan-20-2025, 14:44:42 GMT–Neural Information Processing Systems

This is interesting work that is pointing into the right direction, but a few aspects of this paper are a bit problematic: 1) It would have been useful (or interesting) to use a corpus that has existing text captions, and either have users re-speak the text captions, or collect additional captions. The data collections seems generally well thought-out, but why was the Places205 data set used? Prompted speech (such as collected here) is not "spontaneous", otherwise the WSJ recognizer would not have given 20 % WER (this aspect is irrelevant for the purpose of this paper, though, I think). Typically, multiple captions are being generated for a single image. Has this been done here as well? Or is there only a single caption for each image?

caption, spoken language, unsupervised learning, (7 more...)

Neural Information Processing Systems

Jan-20-2025, 14:44:42 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Speech (0.40)
  - Machine Learning > Unsupervised or Indirectly Supervised Learning (0.40)