Question Answering
Reviews: RUBi: Reducing Unimodal Biases for Visual Question Answering
Originality: The proposed method is a novel dynamic loss re-weighting technique applied to VQA under changing priors condition, aka VQA-CP, where the train and test sets are deliberately constructed to have different distributions. The related works are adequately cited and discussed. While prior works have also focused on using knowledge from a question-only model to capture unnecessary biases in the dataset [25], the paper differs from [25] in some key aspects. E.g., the proposed model guides the whole model (including the visual encoding branch) to learn "harder" examples better whereas [25] focuses on only reducing bias from question encoding. Quality: The proposed method is sound and well-motivated.
Reviews: RUBi: Reducing Unimodal Biases for Visual Question Answering
After the authors' rebuttal all reviewers believe the paper makes a significant enough contribution to be accepted to the conference. When there is a need to obtain large amounts of data for complex tasks such as VQA, bias in the labeling process is highly likely. Techniques that improve robustness to such biases can have a significant impact in these cases. The authors should incorporate the clarifications and results from the rebuttal into the paper and address the reviewers comments.
Reviews: Self-Critical Reasoning for Robust Visual Question Answering
Originality: The proposed work is inspired from an existing work โ HINT (Selvaraju et al., arXiv 2019) which also proposes a novel training objective to align gradient based model's importance for various object proposals in the image with the regions identified as important by humans. This paper improves upon HINT by โ 1) instead of training the model to align its gradient based importance with regions identified as important by humans, the paper trains the model to strengthen its importance for the most influential region -- proposal deemed as most important as per the model's gradients based importance among the set of regions identified as most important by humans, 2) in addition to using visual regions identified as important by humans, the paper also introduces using textual explanations provided by humans and training QA pairs to identify important image regions, 2) the paper proposes another term in the objective that criticizes incorrect predicted answers being more sensitive to the influential region than correct answers. Quality: The paper does a good job of evaluating the proposed approach on both the VQA-CP and VQA datasets. The evaluation of the ablations of the proposed approach and false sensitivity rate are also useful. Clarity: The paper is clear for the most part except the following โ 1. Currently, in order to understand how the gradients from the proposed training objectives are effecting the model's parameters, one needs to read the HINT paper.
Review for NeurIPS paper: Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
Additional Feedback: * Adding more details about graph isomorphism networks and sinkhorn normalization in the model section in page 4 will be useful. I'm wondering why not to use the standard CLEVR questions to measure that? I believe that as long as the newly introduced data doesn't provide or allow testing new aspects or tasks, it's better to use common data for better comparability to prior approaches. In addition, the standard CLEVR questions allow further variability in answers and reasoning skills needed than true/false statements and is carefully constructed to mitigate shortcuts and biases and so may be a better benchmark to use for the task of compositional reasoning. If so, when are the new True/False generated statements that are discussed in the bottom part of page 5 are used?
Review for NeurIPS paper: Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
After the author response and discussion all reviewers recommend (weak) accept of this paper for its contributions including: - Significant improvements on the synthetic CLEVR/CLOSURE task - Overall novel and interesting method I accept the paper with the expectation that the author will improve and clarify the paper according the author response and suggestions by the reviewers, including discussion of related work. The main concern of the reviewers and I is that the paper limits their experimental evaluation to the synthetic CLEVR dataset. The authors are strongly encouraged to include results on a non-synthetic dataset (e.g. VQA-CP, NVLR/2, GQA - or subsets if necessary) in the final version, even if results in a negative result which could be analyzed by the authors.
RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering
Bai, Yang, Grant, Christan Earl, Wang, Daisy Zhe
Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Code and data are available at: https://github.com/TonyBY/RAMQA
Reviews: Visual Question Answering with Question Representation Update (QRU)
Strength: The technical contributions are a clever and simple extension/combination of existing ideas such as "Neural Reasoner" [B. Show, attend and tell: 307 Neural image caption generation with visual attention. The paper is well-written and easy to follow, especially the architecture of the model and the explanations for it are modular and simple (image understanding layer, question encoding layer, reasoning layer, and answering layer). Haven't yet encountered a VQA system that changes the question representation based on image. This novelty adds strength to this paper.
Reviews: Hierarchical Question-Image Co-Attention for Visual Question Answering
The paper presents an incremental contribution with respect to previous methods for VQA that only exploit an image attention mechanism guided by question data. Here, they also consider a question attention mechanism guided by image information. In this sense, the main hypothesis of this work is that jointly considering visual and question attention mechanisms can improve the performance of current VQA systems. I agree that this hypothesis can be relevant for the case of long questions, but I believe there is also a risk that question based attention guided by image information can be misleading, in the sense that usually an image includes several information sources, while the question is more focused. In Figure 3, authors include a graph that shows the impact of question length in performance, while this figure seems to show a tendency, the effect is still weak, maybe a numerical analysis can help to support this point. I believe, an analysis of potential differences (not only question length) between most common errors of previous works (only image attention) and the proposed approach (image and question attention) can help to support the relevance of the proposed attention mechanism.
Reviews: Learning to Reason with Third Order Tensor Products
Summary This paper presents a question-answering system based on tensor product representations. Given a latent sentence encoding, different MLPs extract entity and relation representations which are then used to update an tensor product representations of order-3 and trained end-to-end from the downstream success of correctly answering the question. Experiments are limited to bAbI question answering, which is disappointing as this is a synthetic corpus with a simple known underlying triples structure. While the proposed system outperforms baselines like recurrent entity networks (RENs) by a small difference in mean error, RENs have also been applied to more real-world tasks such as the Children's Book Test (CBT). Strengths - I like that the authors do not just report the best performance of their model, but also the mean and variance from five runs.