Reviews: Multimodal Learning and Reasoning for Visual Question Answering

Neural Information Processing Systems 

The paper introduces a novel modular neural network for multimodal tasks such as Visual Question Answering. The paper argues that a single visual representation is not sufficient for VQA and using some task specific visual features such as scene classification or object detection would result in a better VQA model. Following this motivation, the paper proposes a VQA model with modules tailored for specific tasks -- scene classification, object detection/classification, face detection/analysis -- and pushes the state-of-the-art performance. Strengths -- -- Since VQA spans many lower level vision tasks such as object detection, scene classification, etc., it makes a lot of sense that the visual features tailored for these tasks should help for the task of VQA. According to my knowledge, this is the first paper which explicitly uses this information in building their model, and shows the importance of visual features from each task in their ablation studies.