Plotting

 Chai, Qi


FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

arXiv.org Artificial Intelligence

--Audio-Visual Question Answering (A VQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing A VQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. T o address these challenges, we first introduce a novel dataset, FortisA VQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-A VQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MA VEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisA VQA, with a notable improvement of 7.81%. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. UMANS possess the extraordinary capacity to seam-lessly integrate auditory and visual cues, effectively establishing a cohesive relationship between visual and auditory stimuli [1-3]. Jie Ma, Pinghui Wang, Jing Tao and Zhou Su are with the Ministry of Education of Key Laboratory for Intelligent Networks and Network Security, School of Cyber Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China. Zhitao Gao and Jun Liu are with the Shannxi Provincial Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China. Qi Chai is with the Information Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 510000, China. The question in current A VQA datasets is generated by a limited set of predefined templates, which may not be in line with the real-world scenario. Our findings indicate that existing methods such as STG [6] are not robust, which may be attributed to excessive bias learning, such as memorizing statistical regularities between critical question words and answers. It requires the system to learn high-order interaction representations of the concepts encompassed with audio, video, and language modalities. As is known to us [8-10], the high-level reasoning ability of the system mainly relies on large-scale data that does not contain harmful biases or statistical regularities. However, completely avoiding the negative bias in datasets seems challenging [11] due to the inherent skewness in real-world data distributions.


MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

arXiv.org Artificial Intelligence

Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and various diagram inputs. It brings new challenges to the representation capability of language model for domain-specific spans. And it also pushes the multimodal fusion to a more complex level. To tackle the above issues, we propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the TQA task. Firstly, we introduce a multi-stage domain pretraining module to conduct unsupervised post-pretraining with the span mask strategy and supervised pre-finetune. Especially for domain post-pretraining, we propose a heuristic generation algorithm to employ the terminology corpus. Secondly, to fully consider the rich inputs of context and diagrams, we propose cross-guided multimodal attention to update the features of text, question diagram and instructional diagram based on a progressive strategy. Further, a dual gating mechanism is adopted to improve the model ensemble. The experimental results show the superiority of our model, which outperforms the state-of-the-art methods by 2.21% and 2.43% for validation and test split respectively.