Question Answering
What You See is What You Read? Improving Text-Image Alignment Evaluation
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images.
Localizing Factual Inconsistencies in Attributable Text Generation
Cattan, Arie, Roit, Paul, Zhang, Shiyue, Wan, David, Aharoni, Roee, Szpektor, Idan, Bansal, Mohit, Dagan, Ido
There has been an increasing interest in detecting hallucinations in model-generated texts, both manually and automatically, at varying levels of granularity. However, most existing methods fail to precisely pinpoint the errors. In this work, we introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation, at a fine-grained level. Drawing inspiration from Neo-Davidsonian formal semantics, we propose decomposing the generated text into minimal predicate-argument level propositions, expressed as simple question-answer (QA) pairs, and assess whether each individual QA pair is supported by a trusted reference text. As each QA pair corresponds to a single semantic relation between a predicate and an argument, QASemConsistency effectively localizes the unsupported information. We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation, by collecting crowdsourced annotations of granular consistency errors, while achieving a substantial inter-annotator agreement ($\kappa > 0.7)$. Then, we implement several methods for automatically detecting localized factual inconsistencies, with both supervised entailment models and open-source LLMs.
Reviews: Learning Conditioned Graph Structures for Interpretable Visual Question Answering
The predicted graph connectivity (at least in these few examples) looks quite intuitive and interpretable, even when the model predicts the incorrect answer. Weaknesses -- Figure 2 caption says "[insert quick recap here]":) -- The paper emphasizes multiple times that the proposed approach achieves state of the art accuracies on VQA v2, but that does not seem to be the case. The best published result so far -- the counting module by Zhang et al., ICLR 2018 -- performs 3% better than the proposed approach (as shown in Table 1 as well). This claim needs to be sufficiently toned down. Also, the proposed approach is marginally better than the base Bottom-Up architecture.
Reviews: Multimodal Learning and Reasoning for Visual Question Answering
The paper introduces a novel modular neural network for multimodal tasks such as Visual Question Answering. The paper argues that a single visual representation is not sufficient for VQA and using some task specific visual features such as scene classification or object detection would result in a better VQA model. Following this motivation, the paper proposes a VQA model with modules tailored for specific tasks -- scene classification, object detection/classification, face detection/analysis -- and pushes the state-of-the-art performance. Strengths -- -- Since VQA spans many lower level vision tasks such as object detection, scene classification, etc., it makes a lot of sense that the visual features tailored for these tasks should help for the task of VQA. According to my knowledge, this is the first paper which explicitly uses this information in building their model, and shows the importance of visual features from each task in their ablation studies.
Reviews: Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base
This paper proposes a semantic parsing method for dialog-based QA over a large-scale knowledge base. The method significantly outperforms the existing state of the art on CSQA, a recently-released conversational QA dataset. One of the major novelties of this paper is breaking apart the logical forms in the dialog history into smaller subsequences, any of which can be copied over into the logical form for the current question. While I do have some concerns with the method and the writing (detailed below), overall I liked this paper and I think that some of the ideas within it could be useful more broadly for QA researchers. Detailed comments: - I found many parts of the paper to be confusing, requiring multiple reads to fully understand.
Reviews: Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering
This ignores the inherent graph structure of the knowledge base, and performs reasoning from facts to answer one at a time, which is computationally inefficient. Two entities have a connecting edge if they belong to the same fact. Strengths -- The proposed approach is intuitive, sufficiently novel, and outperforms prior work by a large margin -- 10% better than the previous best approach, which is an impressive result. Weaknesses -- Given that the fact retrieval step is still the bottleneck in terms of accuracy (Table 4), it would be useful to check how sensitive downstream accuracy is to the choice of retrieving 100 facts. What is the answering accuracy if 50 facts are retrieved?
Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA
Gor, Maharshi, Daumé, Hal III, Zhou, Tianyi, Boyd-Graber, Jordan
Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP) tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA, a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from ~70 AI systems and 155 humans across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills. Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval and fact-based reasoning, particularly when information gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement human cognitive abilities in real-world problem-solving.
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
Xie, Xudong, Yin, Liang, Yan, Hao, Liu, Yang, Ding, Jing, Liao, Minghui, Liu, Yuliang, Chen, Wei, Bai, Xiang
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially in academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler is integrated with the MLLM's image encoder and selects the paragraphs or diagrams most pertinent to user queries for processing by the language model. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of academic papers sourced from arXiv, multiple strategies are proposed to generate automatically 1M QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal PDF understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.
CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures
Sviridova, Ekaterina, Yeginbergen, Anar, Estarrona, Ainara, Cabrio, Elena, Villata, Serena, Agerri, Rodrigo
Explaining Artificial Intelligence (AI) decisions is a major challenge nowadays in AI, in particular when applied to sensitive scenarios like medicine and law. However, the need to explain the rationale behind decisions is a main issue also for human-based deliberation as it is important to justify \textit{why} a certain decision has been taken. Resident medical doctors for instance are required not only to provide a (possibly correct) diagnosis, but also to explain how they reached a certain conclusion. Developing new tools to aid residents to train their explanation skills is therefore a central objective of AI in education. In this paper, we follow this direction, and we present, to the best of our knowledge, the first multilingual dataset for Medical Question Answering where correct and incorrect diagnoses for a clinical case are enriched with a natural language explanation written by doctors. These explanations have been manually annotated with argument components (i.e., premise, claim) and argument relations (i.e., attack, support), resulting in the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases in four languages (English, Spanish, French, Italian) with explanations, where we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 attack relations. We conclude by showing how competitive baselines perform over this challenging dataset for the argument mining task.