Robust Visual Reasoning via Language Guided Neural Module Networks

Oct-10-2024, 14:38:19 GMT–Neural Information Processing Systems

Neural module networks (NMN) are a popular approach for solving multi-modal tasks such as visual question answering (VQA) and visual referring expression recognition (REF). A key limitation in prior implementations of NMN is that the neural modules do not effectively capture the association between the visual input and the relevant neighbourhood context of the textual input. For instance, NMN fail to understand new concepts such as "yellow sphere to the left" even when it is a combination of known concepts from train data: "blue sphere", "yellow cube", and "metallic cube to the left". In this paper, we address this limitation by introducing a language-guided adaptive convolution layer (LG-Conv) into NMN, in which the filter weights of convolutions are explicitly multiplied with a spatially varying language-guided kernel. Our model allows the neural module to adaptively co-attend over potential objects of interest from the visual and textual inputs.

language guided neural module network, nmn, robust visual reasoning, (4 more...)

Neural Information Processing Systems

Oct-10-2024, 14:38:19 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.62)