malinowski2015
Export Reviews, Discussions, Author Feedback and Meta-Reviews
Summary: This paper addresses the task of image-based Q&A on 2 axes: comparison of different models on 2 datasets and creation of a new dataset based on existing captions. Quality: The paper is addressing an important and interesting new topic which has seen recent surge of interest (Malinowski2014, Malinowski2015, Antol2015, Gao2015, etc.). The paper is technically sound, well-written, and well-organized. They achieve good results on both datasets and the baselines are useful to understand important ablations. The new dataset is also much larger than previous work, allowing training of stronger models, esp. However, there are several weaknesses: their main model is not very different from existing work on image-Q&A (Malinowski2015, who also had a VIS LSTM style model (but they were also jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers) and achieves similar performance (except that adding bidirectionality and 2-way image input helps).