SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Cao, Feiqi, Luo, Siwen, Nunez, Felipe, Wen, Zean, Poon, Josiah, Han, Caren

Aug-7-2023–arXiv.org Artificial Intelligence

Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Aug-7-2023

arXiv.org PDF

Add feedback

Country:
- Africa > Middle East
  - Tunisia (0.14)
- Europe (1.00)
- North America > United States
  - California (0.14)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.93)
  - Natural Language > Text Processing (1.00)
  - Representation & Reasoning (1.00)
  - Vision > Optical Character Recognition (0.86)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found