Goto

Collaborating Authors

 multimodal graph network


Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

Neural Information Processing Systems

Compositional generalization is a key challenge in grounding natural language to visual perception. While deep learning models have achieved great success in multimodal tasks like visual question answering, recent studies have shown that they fail to generalize to new inputs that are simply an unseen combination of those seen in the training distribution. In this paper, we propose to tackle this challenge by employing neural factor graphs to induce a tighter coupling between concepts in different modalities (e.g.



Review for NeurIPS paper: Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

Neural Information Processing Systems

Additional Feedback: * Adding more details about graph isomorphism networks and sinkhorn normalization in the model section in page 4 will be useful. I'm wondering why not to use the standard CLEVR questions to measure that? I believe that as long as the newly introduced data doesn't provide or allow testing new aspects or tasks, it's better to use common data for better comparability to prior approaches. In addition, the standard CLEVR questions allow further variability in answers and reasoning skills needed than true/false statements and is carefully constructed to mitigate shortcuts and biases and so may be a better benchmark to use for the task of compositional reasoning. If so, when are the new True/False generated statements that are discussed in the bottom part of page 5 are used?


Review for NeurIPS paper: Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

Neural Information Processing Systems

After the author response and discussion all reviewers recommend (weak) accept of this paper for its contributions including: - Significant improvements on the synthetic CLEVR/CLOSURE task - Overall novel and interesting method I accept the paper with the expectation that the author will improve and clarify the paper according the author response and suggestions by the reviewers, including discussion of related work. The main concern of the reviewers and I is that the paper limits their experimental evaluation to the synthetic CLEVR dataset. The authors are strongly encouraged to include results on a non-synthetic dataset (e.g. VQA-CP, NVLR/2, GQA - or subsets if necessary) in the final version, even if results in a negative result which could be analyzed by the authors.


Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

Neural Information Processing Systems

Compositional generalization is a key challenge in grounding natural language to visual perception. While deep learning models have achieved great success in multimodal tasks like visual question answering, recent studies have shown that they fail to generalize to new inputs that are simply an unseen combination of those seen in the training distribution. In this paper, we propose to tackle this challenge by employing neural factor graphs to induce a tighter coupling between concepts in different modalities (e.g. Graph representations are inherently compositional in nature and allow us to capture entities, attributes and relations in a scalable manner. Our model first creates a multimodal graph, processes it with a graph neural network to induce a factor correspondence matrix, and then outputs a symbolic program to predict answers to questions.