Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

Jun-26-2024–arXiv.org Artificial Intelligence

We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Although many recent works use question-dependent captioners to verbalize the given image and use Large Language Models to solve the VQA problem, the research results show they are not reasonably performing for multi-hop questions. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it. Moreover, we analyze the decomposed questions to find out the modality of the information that is required to answer them and use a captioner for the visual questions and LLMs as a general knowledge source for the non-visual KB-based questions. Our results demonstrate the positive impact of using simple questions before retrieving visual or non-visual information. We have provided results and analysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA, and achieved up to 2% improvement in accuracy.

captioner, information, knowledge, (13 more...)

arXiv.org Artificial Intelligence

Jun-26-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Japan (0.04)
- Europe > Greece (0.04)
- North America > United States
  - Michigan (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Expert Systems (1.00)
  - Natural Language > Large Language Model (0.91)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found