Goto

Collaborating Authors

 robust visual reasoning


Robust Visual Reasoning via Language Guided Neural Module Networks

Neural Information Processing Systems

Neural module networks (NMN) are a popular approach for solving multi-modal tasks such as visual question answering (VQA) and visual referring expression recognition (REF). A key limitation in prior implementations of NMN is that the neural modules do not effectively capture the association between the visual input and the relevant neighbourhood context of the textual input.


Robust Visual Reasoning via Language Guided Neural Module Networks Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

Finally we provide additional results and analysis to supplement Section 4.5 of the main paper. The neural modules take either two visual inputs (binary modules) or one visual input (unary modules). In the original IEP-Ref implementation, there are total 60 distinct modules in IEP-Ref. Contrast Sets, consisting of samples that help in exposing model brittleness by probing a model's In this section, we provide more results comparing the performance of our model with baselines. Specifically, we analyze the model's performance in terms of filtering the objects based on the Results show that our approach significantly outperforms baselines.


Robust Visual Reasoning via Language Guided Neural Module Networks

Neural Information Processing Systems

Neural module networks (NMN) are a popular approach for solving multi-modal tasks such as visual question answering (VQA) and visual referring expression recognition (REF). A key limitation in prior implementations of NMN is that the neural modules do not effectively capture the association between the visual input and the relevant neighbourhood context of the textual input. For instance, NMN fail to understand new concepts such as "yellow sphere to the left" even when it is a combination of known concepts from train data: "blue sphere", "yellow cube", and "metallic cube to the left". In this paper, we address this limitation by introducing a language-guided adaptive convolution layer (LG-Conv) into NMN, in which the filter weights of convolutions are explicitly multiplied with a spatially varying language-guided kernel. Our model allows the neural module to adaptively co-attend over potential objects of interest from the visual and textual inputs.