Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

Zhang, Yundong, Niebles, Juan Carlos, Soto, Alvaro

Aug-1-2018–arXiv.org Artificial Intelligence

A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.

machine learning, natural language, question answering, (19 more...)

arXiv.org Artificial Intelligence

Aug-1-2018

arXiv.org PDF

Add feedback

Country:
- South America > Chile (0.04)
- North America > United States
  - California > Santa Clara County > Palo Alto (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Question Answering (0.53)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found