Language-Bias-Resilient Visual Question Answering via Adaptive Multi-Margin Collaborative Debiasing
–Neural Information Processing Systems
Language bias in Visual Question Answering (VQA) arises when models exploit spurious statistical correlations between question templates and answers, particularly in out-of-distribution scenarios, thereby neglecting essential visual cues and compromising genuine multimodal reasoning. Despite numerous efforts to enhance the robustness of VQA models, a principled understanding of how such bias originates and influences model behavior remains underdeveloped. In this paper, we address this gap through a comprehensive empirical and theoretical analysis, revealing that modality-specific gradient imbalances, which originate from the inherent heterogeneity of multimodal data, lead to skewed feature fusion and biased classifier weights. To alleviate these issues, we propose a novel MultiMargin Collaborative Debiasing (MMCD) framework2, which adaptively integrates frequency-aware, confidence-aware, and difficulty-aware angular margins with a dynamic, difficulty-aware contrastive learning mechanism to reshape decision boundaries under biased training conditions. Extensive experiments across multiple challenging VQA benchmarks confirm the consistent superiority of our proposed MMCD over state-of-the-art baselines in combating language bias.
Neural Information Processing Systems
Jun-22-2026, 05:26:47 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Natural Language > Question Answering (0.62)
- Machine Learning
- Statistical Learning (0.48)
- Neural Networks (0.46)
- Information Technology > Artificial Intelligence