Reciprocal Attention Fusion for Visual Question Answering

May-11-2018–arXiv.org Artificial Intelligence

Existing attention mechanisms either attend to local image grid or object level features for Visual Question Answering (VQA). Motivated by the observation that questions can relate to both object instances and their parts, we propose a novel attention mechanism that jointly considers reciprocal relationships between the two levels of visual details. The bottom-up attention thus generated is further coalesced with the top-down information to only focus on the scene elements that are most relevant to a given question. Our design hierarchically fuses multi-modal information i.e., language, object- and gird-level features, through an efficient tensor decomposition scheme. The proposed model improves the state-of-the-art single model performances from 67.9% to 68.2% on VQAv1 and from 65.3% to 67.4% on VQAv2, demonstrating a significant boost.

arxiv preprint arxiv, machine learning, question answering, (14 more...)

arXiv.org Artificial Intelligence

May-11-2018

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.14)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Question Answering (0.73)
  - Machine Learning
    - Neural Networks > Deep Learning (0.94)
    - Statistical Learning (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found