self-critical reasoning
Self-Critical Reasoning for Robust Visual Question Answering
Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e. 49.5\% using textual explanations and 48.5\% using automatically
Reviews: Self-Critical Reasoning for Robust Visual Question Answering
Originality: The proposed work is inspired from an existing work โ HINT (Selvaraju et al., arXiv 2019) which also proposes a novel training objective to align gradient based model's importance for various object proposals in the image with the regions identified as important by humans. This paper improves upon HINT by โ 1) instead of training the model to align its gradient based importance with regions identified as important by humans, the paper trains the model to strengthen its importance for the most influential region -- proposal deemed as most important as per the model's gradients based importance among the set of regions identified as most important by humans, 2) in addition to using visual regions identified as important by humans, the paper also introduces using textual explanations provided by humans and training QA pairs to identify important image regions, 2) the paper proposes another term in the objective that criticizes incorrect predicted answers being more sensitive to the influential region than correct answers. Quality: The paper does a good job of evaluating the proposed approach on both the VQA-CP and VQA datasets. The evaluation of the ablations of the proposed approach and false sensitivity rate are also useful. Clarity: The paper is clear for the most part except the following โ 1. Currently, in order to understand how the gradients from the proposed training objectives are effecting the model's parameters, one needs to read the HINT paper.
Self-Critical Reasoning for Robust Visual Question Answering
Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e. 49.5\% using textual explanations and 48.5\% using automatically
Self-Critical Reasoning for Robust Visual Question Answering
Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e. 49.5\% using textual explanations and 48.5\% using automatically Papers published at the Neural Information Processing Systems Conference.