Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering

Li, Zhifei, Qiu, Feng, Wang, Yiran, Xia, Yujing, Xiao, Kui, Zhang, Miao, Zhang, Yan

Sep-26-2025–arXiv.org Artificial Intelligence

Abstract--Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA. ISUAL Question Answering (VQA) [1] is an interdisciplinary field that combines the challenges of computer vision and natural language processing to generate accurate answers to questions about images. This task requires a deep understanding of both the visual content and the contextual nuances posed by the questions, making it a complex and demanding research area. Despite significant advancements in recent years, current VQA models often struggle with biases introduced by training data [2], [3], [4], leading to an over-reliance on superficial patterns and correlations rather than genuine visual reasoning and understanding.

machine learning, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

Sep-26-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.29)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Education (0.94)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Question Answering (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found