Goto

Collaborating Authors

 Question Answering


Multilingual Non-Factoid Question Answering with Silver Answers

arXiv.org Artificial Intelligence

Most existing Question Answering Datasets (QuADs) primarily focus on factoid-based short-context Question Answering (QA) in high-resource languages. However, the scope of such datasets for low-resource languages remains limited, with only a few works centered on factoid-based QuADs and none on non-factoid QuADs. Therefore, this work presents MuNfQuAD, a multilingual QuAD with non-factoid questions. It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers. The dataset comprises over 370K QA pairs across 38 languages, encompassing several low-resource languages, and stands as the largest multilingual QA dataset to date. Based on the manual annotations of 790 QA-pairs from MuNfQuAD (golden set), we observe that 98\% of questions can be answered using their corresponding silver answer. Our fine-tuned Answer Paragraph Selection (APS) model outperforms the baselines. The APS model attained an accuracy of 80\% and 72\%, as well as a macro F1 of 72\% and 66\%, on the MuNfQuAD testset and the golden set, respectively. Furthermore, the APS model effectively generalizes certain a language within the golden set, even after being fine-tuned on silver labels.


Risks and NLP Design: A Case Study on Procedural Document QA

arXiv.org Artificial Intelligence

As NLP systems are increasingly deployed at scale, concerns about their potential negative impacts have attracted the attention of the research community, yet discussions of risk have mostly been at an abstract level and focused on generic AI or NLP applications. We argue that clearer assessments of risks and harms to users--and concrete strategies to mitigate them--will be possible when we specialize the analysis to more concrete applications and their plausible users. As an illustration, this paper is grounded in cooking recipe procedural document question answering (ProcDocQA), where there are well-defined risks to users such as injuries or allergic reactions. Our case study shows that an existing language model, applied in "zero-shot" mode, quantitatively answers real-world questions about recipes as well or better than the humans who have answered the questions on the web. Using a novel questionnaire informed by theoretical work on AI risk, we conduct a risk-oriented error analysis that could then inform the design of a future system to be deployed with lower risk of harm and better performance.


RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions

arXiv.org Artificial Intelligence

Clinical question answering systems have the potential to provide clinicians with relevant and timely answers to their questions. Nonetheless, despite the advances that have been made, adoption of these systems in clinical settings has been slow. One issue is a lack of question-answering datasets which reflect the real-world needs of health professionals. In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions. We show that the LLM is more cost-efficient for generating "ideal" QA pairs. Additionally, we achieve a lower lexical similarity between questions and answers than BioASQ which provides an additional challenge to the top two QA models, as per the results. Introduction Clinical question answering (QA) systems could allow clinicians to find timely and relevant answers to questions occurring during consultations in real-time [1, 2, 3, 4, 5].


IIU: Independent Inference Units for Knowledge-based Visual Question Answering

arXiv.org Artificial Intelligence

Knowledge-based visual question answering requires external knowledge beyond visible content to answer the question correctly. One limitation of existing methods is that they focus more on modeling the inter-modal and intra-modal correlations, which entangles complex multimodal clues by implicit embeddings and lacks interpretability and generalization ability. The key challenge to solve the above problem is to separate the information and process it separately at the functional level. By reusing each processing unit, the generalization ability of the model to deal with different data can be increased. In this paper, we propose Independent Inference Units (IIU) for fine-grained multi-modal reasoning to decompose intra-modal information by the functionally independent units. Specifically, IIU processes each semantic-specific intra-modal clue by an independent inference unit, which also collects complementary information by communication from different units. To further reduce the impact of redundant information, we propose a memory update module to maintain semantic-relevant memory along with the reasoning process gradually. In comparison with existing non-pretrained multi-modal reasoning models on standard datasets, our model achieves a new state-of-the-art, enhancing performance by 3%, and surpassing basic pretrained multi-modal models. The experimental results show that our IIU model is effective in disentangling intra-modal clues as well as reasoning units to provide explainable reasoning evidence. Our code is available at https://github.com/Lilidamowang/IIU.


Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion

arXiv.org Artificial Intelligence

Visual Question Answering (VQA) is a challenging task that requires systems to provide accurate answers to questions based on image content. Current VQA models struggle with complex questions due to limitations in capturing and integrating multimodal information effectively. To address these challenges, we propose the Rank VQA model, which leverages a ranking-inspired hybrid training strategy to enhance VQA performance. The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model. These features are fused through a sophisticated multimodal fusion technique employing multi-head self-attention mechanisms. Additionally, a ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy. The hybrid training strategy combines classification and ranking losses, enhancing the model's generalization ability and robustness across diverse datasets. Experimental results demonstrate the effectiveness of the Rank VQA model. Our model significantly outperforms existing state-of-the-art models on standard VQA datasets, including VQA v2.0 and COCO-QA, in terms of both accuracy and Mean Reciprocal Rank (MRR). The superior performance of Rank VQA is evident in its ability to handle complex questions that require understanding nuanced details and making sophisticated inferences from the image and text. This work highlights the effectiveness of a ranking-based hybrid training strategy in improving VQA performance and lays the groundwork for further research in multimodal learning methods.


FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection

arXiv.org Artificial Intelligence

Open Domain Question Answering (ODQA) has been advancing rapidly in recent times, driven by significant developments in dense passage retrieval and pretrained language models. Current models typically incorporate the FiD framework, which is composed by a neural retriever alongside an encoder-decoder neural reader. In the answer generation process, the retriever will retrieve numerous passages (around 100 for instance), each of which is then individually encoded by the encoder. Subsequently, the decoder makes predictions based on these encoded passages. Nevertheless, this framework can be relatively time-consuming, particularly due to the extensive length of the gathered passages. To address this, we introduce FastFiD in this paper, a novel approach that executes sentence selection on the encoded passages. This aids in retaining valuable sentences while reducing the context length required for generating answers. Experiments on three commonly used datasets (Natural Questions, TriviaQA and ASQA) demonstrate that our method can enhance the inference speed by 2.3X-5.7X, while simultaneously maintaining the model's performance. Moreover, an in-depth analysis of the model's attention reveals that the selected sentences indeed hold a substantial contribution towards the final answer. The codes are publicly available at https://github.com/thunlp/FastFiD.


Reference-free Hallucination Detection for Large Vision-Language Models

arXiv.org Artificial Intelligence

Large vision-language models (LVLMs) have made significant progress in recent years. While LVLMs exhibit excellent ability in language understanding, question answering, and conversations of visual inputs, they are prone to producing hallucinations. While several methods are proposed to evaluate the hallucinations in LVLMs, most are reference-based and depend on external tools, which complicates their practical application. To assess the viability of alternative methods, it is critical to understand whether the reference-free approaches, which do not rely on any external tools, can efficiently detect hallucinations. Therefore, we initiate an exploratory study to demonstrate the effectiveness of different reference-free solutions in detecting hallucinations in LVLMs. In particular, we conduct an extensive study on three kinds of techniques: uncertainty-based, consistency-based, and supervised uncertainty quantification methods on four representative LVLMs across two different tasks. The empirical results show that the reference-free approaches are capable of effectively detecting non-factual responses in LVLMs, with the supervised uncertainty quantification method outperforming the others, achieving the best performance across different settings.


Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

arXiv.org Artificial Intelligence

Emerging multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA). Recent efforts primarily focus on scaling up training datasets (i.e., charts, data tables, and question-answer (QA) pairs) through data collection and synthesis. However, our empirical study on existing MLLMs and CQA datasets reveals notable gaps. First, current data collection and synthesis focus on data volume and lack consideration of fine-grained visual encodings and QA tasks, resulting in unbalanced data distribution divergent from practical CQA scenarios. Second, existing work follows the training recipe of the base MLLMs initially designed for natural images, under-exploring the adaptation to unique chart characteristics, such as rich text elements. To fill the gap, we propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development. Specifically, we propose a novel data engine to effectively filter diverse and high-quality data from existing datasets and subsequently refine and augment the data using LLM-based generation techniques to better align with practical QA tasks and visual encodings. Then, to facilitate the adaptation to chart characteristics, we utilize the enriched data to train an MLLM by unfreezing the vision encoder and incorporating a mixture-of-resolution adaptation strategy for enhanced fine-grained recognition. Experimental results validate the effectiveness of our approach. Even with fewer training examples, our model consistently outperforms state-of-the-art CQA models on established benchmarks. We also contribute a dataset split as a benchmark for future research. Source codes and datasets of this paper are available at https://github.com/zengxingchen/ChartQA-MLLM.


Chain of Condition: Construct, Verify and Solve Conditions for Conditional Question Answering

arXiv.org Artificial Intelligence

Conditional question answering (CQA) is an important task that aims to find probable answers and identify conditions that need to be satisfied to support the answer. Existing approaches struggle with CQA due to two main challenges: (1) precisely identifying conditions and their logical relationship, and (2) verifying and solving the conditions. To address these challenges, we propose Chain of Condition, a novel prompting approach by firstly identifying all conditions and constructing their logical relationships explicitly according to the document, then verifying whether these conditions are satisfied, finally solving the logical expression by tools to indicate any missing conditions and generating the answer based on the resolved conditions. The experiments on two benchmark conditional question answering datasets shows chain of condition outperforms existing prompting baselines, establishing a new state-of-the-art. Furthermore, with backbone models like GPT-3.5-Turbo or GPT-4, it surpasses all supervised baselines with only few-shot settings.


MSG-Chart: Multimodal Scene Graph for ChartQA

arXiv.org Artificial Intelligence

Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. To address this challenge, we design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart. This graph Figure 1: Cutting-Edge LLMs and Our MSG-Chart module can be easily integrated with different vision transformers as inductive bias. Our experiments demonstrate that incorporating charts often include extensive text and numerical data, understanding the proposed graph module enhances the understanding of charts' these features is crucial for accurate question answering. While elements' structure and semantics, thereby improving performance recognizing the underlying text of an object is enough for data extraction on publicly available benchmarks, ChartQA and OpenCQA.