mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Wei, Jingxuan, Xu, Nan, Chang, Guiyong, Luo, Yin, Yu, BiHui, Guo, Ruifeng

arXiv.org Artificial Intelligence 

The goal of multimodal chart question answering is to automatically answer a natural language question about a chart to facilitate visual data analysis (Hoque et al., 2022), where the ability to understand and interact with visual data is essential (Masry et al., 2022). It has emerged as a crucial intersection of computer vision and natural language processing, addressing the growing demand for intelligent systems capable of interpreting complex visual data in charts (Masry et al., 2022). Beyond its general applications, multimodal chart question-answering plays a pivotal role in sectors requiring precise and rapid analysis of visual data. In the financial domain, it is indispensable for tasks such as financial report analysis (Wang et al., 2023a), decision support (Kafle et al., 2020), invoice parsing (Gerling and Lessmann, 2023), and contract review (Jie et al., 2023). Similarly, in the medical field, it significantly contributes to the digitization of patient records (Xu et al., 2021), medical insurance review (Meskó, 2023), diagnostic assistance (Othmani and Zeghina, 2022), and quality control (Schilcher et al., 2024) of medical records. Due to the richness and ambiguities of natural language and complex visual reasoning, multimodal chart question answering task requires to predict the answer in the intersection of information visualization, natural language processing, and human computer interactions (Hoque et al., 2022). Early approaches apply natural language processing techniques by largely depending on heuristics or grammarbased parsing techniques (Setlur et al., 2016; Srinivasan and Stasko, 2017; Hoque et al., 2017; Gao et al., 2015). Thanks to insufficient processing of complex linguistic phenomena, over-reliance on grammatical rules, and limited depth of understanding natural language, deep learning models have been introduced for understanding natural language queries about visualizations (Chaudhry et al., 2020; Singh and Shekhar, 2020; Reddy et al., 2019).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found