Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

Dec-23-2025, 20:22:46 GMT–Neural Information Processing Systems

Compositional generalization is a key challenge in grounding natural language to visual perception. While deep learning models have achieved great success in multimodal tasks like visual question answering, recent studies have shown that they fail to generalize to new inputs that are simply an unseen combination of those seen in the training distribution. In this paper, we propose to tackle this challenge by employing neural factor graphs to induce a tighter coupling between concepts in different modalities (e.g.

compositional generalization, multimodal graph network, name change, (3 more...)

Neural Information Processing Systems

Dec-23-2025, 20:22:46 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.64)
  - Machine Learning > Neural Networks
    - Deep Learning (0.60)