Question Answering
mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning
Wei, Jingxuan, Xu, Nan, Chang, Guiyong, Luo, Yin, Yu, BiHui, Guo, Ruifeng
The goal of multimodal chart question answering is to automatically answer a natural language question about a chart to facilitate visual data analysis (Hoque et al., 2022), where the ability to understand and interact with visual data is essential (Masry et al., 2022). It has emerged as a crucial intersection of computer vision and natural language processing, addressing the growing demand for intelligent systems capable of interpreting complex visual data in charts (Masry et al., 2022). Beyond its general applications, multimodal chart question-answering plays a pivotal role in sectors requiring precise and rapid analysis of visual data. In the financial domain, it is indispensable for tasks such as financial report analysis (Wang et al., 2023a), decision support (Kafle et al., 2020), invoice parsing (Gerling and Lessmann, 2023), and contract review (Jie et al., 2023). Similarly, in the medical field, it significantly contributes to the digitization of patient records (Xu et al., 2021), medical insurance review (Meskรณ, 2023), diagnostic assistance (Othmani and Zeghina, 2022), and quality control (Schilcher et al., 2024) of medical records. Due to the richness and ambiguities of natural language and complex visual reasoning, multimodal chart question answering task requires to predict the answer in the intersection of information visualization, natural language processing, and human computer interactions (Hoque et al., 2022). Early approaches apply natural language processing techniques by largely depending on heuristics or grammarbased parsing techniques (Setlur et al., 2016; Srinivasan and Stasko, 2017; Hoque et al., 2017; Gao et al., 2015). Thanks to insufficient processing of complex linguistic phenomena, over-reliance on grammatical rules, and limited depth of understanding natural language, deep learning models have been introduced for understanding natural language queries about visualizations (Chaudhry et al., 2020; Singh and Shekhar, 2020; Reddy et al., 2019).
Explainable Multi-hop Question Generation: An End-to-End Approach without Intermediate Question Labeling
Hwang, Seonjeong, Kim, Yunsu, Lee, Gary Geunbae
In response to the increasing use of interactive artificial intelligence, the demand for the capacity to handle complex questions has increased. Multi-hop question generation aims to generate complex questions that requires multi-step reasoning over several documents. Previous studies have predominantly utilized end-to-end models, wherein questions are decoded based on the representation of context documents. However, these approaches lack the ability to explain the reasoning process behind the generated multi-hop questions. Additionally, the question rewriting approach, which incrementally increases the question complexity, also has limitations due to the requirement of labeling data for intermediate-stage questions. In this paper, we introduce an end-to-end question rewriting model that increases question complexity through sequential rewriting. The proposed model has the advantage of training with only the final multi-hop questions, without intermediate questions. Experimental results demonstrate the effectiveness of our model in generating complex questions, particularly 3- and 4-hop questions, which are appropriately paired with input answers. We also prove that our model logically and incrementally increases the complexity of questions, and the generated multi-hop questions are also beneficial for training question answering models.
How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset
Ghosh, Akash, Sahith, B Venkata, Ganguly, Niloy, Goyal, Pawan, Singh, Mayank
Question-answering (QA) on hybrid scientific tabular and textual data deals with scientific information, and relies on complex numerical reasoning. In recent years, while tabular QA has seen rapid progress, understanding their robustness on scientific information is lacking due to absence of any benchmark dataset. To investigate the robustness of the existing state-of-the-art QA models on scientific hybrid tabular data, we propose a new dataset, "SciTabQA", consisting of 822 question-answer pairs from scientific tables and their descriptions. With the help of this dataset, we assess the state-of-the-art Tabular QA models based on their ability (i) to use heterogeneous information requiring both structured data (table) and unstructured data (text) and (ii) to perform complex scientific reasoning tasks. In essence, we check the capability of the models to interpret scientific tables and text. Our experiments show that "SciTabQA" is an innovative dataset to study question-answering over scientific heterogeneous data. We benchmark three state-of-the-art Tabular QA models, and find that the best F1 score is only 0.462.
Jetsons at FinNLP 2024: Towards Understanding the ESG Impact of a News Article using Transformer-based Models
Dakle, Parag Pravin, Gon, Alolika, Zha, Sihan, Wang, Liang, Rallabandi, SaiKrishna, Raghavan, Preethi
In this paper, we describe the different approaches explored by the Jetsons team for the Multi-Lingual ESG Impact Duration Inference (ML-ESG-3) shared task. The shared task focuses on predicting the duration and type of the ESG impact of a news article. The shared task dataset consists of 2,059 news titles and articles in English, French, Korean, and Japanese languages. For the impact duration classification task, we fine-tuned XLM-RoBERTa with a custom fine-tuning strategy and using self-training and DeBERTa-v3 using only English translations. These models individually ranked first on the leaderboard for Korean and Japanese and in an ensemble for the English language, respectively. For the impact type classification task, our XLM-RoBERTa model fine-tuned using a custom fine-tuning strategy ranked first for the English language.
Multi-hop Question Answering under Temporal Knowledge Editing
Cheng, Keyuan, Lin, Gang, Fei, Haoyang, zhai, Yuxuan, Yu, Lu, Ali, Muhammad Asif, Hu, Lijie, Wang, Di
Multi-hop question answering (MQA) under knowledge editing (KE) has garnered significant attention in the era of large language models. However, existing models for MQA under KE exhibit poor performance when dealing with questions containing explicit temporal contexts. To address this limitation, we propose a novel framework, namely TEMPoral knowLEdge augmented Multi-hop Question Answering (TEMPLE-MQA). Unlike previous methods, TEMPLE-MQA first constructs a time-aware graph (TAG) to store edit knowledge in a structured manner. Then, through our proposed inference path, structural retrieval, and joint reasoning stages, TEMPLE-MQA effectively discerns temporal contexts within the question query. Experiments on benchmark datasets demonstrate that TEMPLE-MQA significantly outperforms baseline models. Additionally, we contribute a new dataset, namely TKEMQA, which serves as the inaugural benchmark tailored specifically for MQA with temporal scopes.
DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering
Nguyen, Alex, Wang, Zilong, Shang, Jingbo, Mekala, Dheeraj
The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.
JDocQA: Japanese Document Question Answering Dataset for Generative Language Models
Onami, Eri, Kurita, Shuhei, Miyanishi, Taiki, Watanabe, Taro
Document question answering is a task of question answering on given documents such as reports, slides, pamphlets, and websites, and it is a truly demanding task as paper and electronic forms of documents are so common in our society. This is known as a quite challenging task because it requires not only text understanding but also understanding of figures and tables, and hence visual question answering (VQA) methods are often examined in addition to textual approaches. We introduce Japanese Document Question Answering (JDocQA), a large-scale document-based QA dataset, essentially requiring both visual and textual information to answer questions, which comprises 5,504 documents in PDF format and annotated 11,600 question-and-answer instances in Japanese. Each QA instance includes references to the document pages and bounding boxes for the answer clues. We incorporate multiple categories of questions and unanswerable questions from the document for realistic question-answering applications. We empirically evaluate the effectiveness of our dataset with text-based large language models (LLMs) and multimodal models. Incorporating unanswerable questions in finetuning may contribute to harnessing the so-called hallucination generation.
Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check
Ye, Linhao, Lei, Zhikai, Yin, Jianghao, Chen, Qin, Zhou, Jie, He, Liang
Retrieval-Augmented Generation (RAG) aims to generate more reliable Conversational Question Answering (CQA) has attracted great and accurate responses, by augmenting large language models attention in both academia and industry in recent years, which (LLMs) with the external vast and dynamic knowledge. Most previous provides more natural human-computer interactions by extending work focuses on using RAG for single-round question answering, single-turn question answering (QA) to conversational settings [23, while how to adapt RAG to the complex conversational setting 33]. In CQA, users usually ask multiple follow-up questions using wherein the question is interdependent on the preceding context is anaphora that refers to certain concepts in previous conversation not well studied. In this paper, we propose a conversation-level RAG history, or ellipsis that can be omitted. As shown in Figure 1, the (ConvRAG) approach, which incorporates fine-grained retrieval augmentation'battle' in the current question refers to'Hunayn' in the first turn, and self-check for conversational question answering making it more challenging than single-turn QA. (CQA). In particular, our approach consists of three components, One key challenge in CQA is how to explicitly represent the namely conversational question refiner, fine-grained retriever and questions based on the interdependent context. Previous work focuses self-check based response generator, which work collaboratively on using the question rewriting methods for a better question for question understanding and relevant information acquisition understanding. Elgoharyet et al. [11] first released a dataset with in conversational settings. Extensive experiments demonstrate the human rewrites of questions and analysed the writing quality.
ArabicaQA: A Comprehensive Dataset for Arabic Question Answering
Abdallah, Abdelrahman, Kasem, Mahmoud, Abdalla, Mahmoud, Mahmoud, Mohamed, Elkasaby, Mohamed, Elbendary, Yasser, Jatowt, Adam
In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.
Denoising Table-Text Retrieval for Open-Domain Question Answering
Kang, Deokhyung, Jung, Baikjin, Kim, Yunsu, Lee, Gary Geunbae
In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.