Question Answering
Diverse Visuo-Lingustic Question Answering (DVLQA) Challenge
Sampat, Shailaja, Yang, Yezhou, Baral, Chitta
Existing question answering datasets mostly contain homogeneous contexts, based on either textual or visual information alone. On the other hand, digitalization has evolved the nature of reading which often includes integrating information across multiple heterogeneous sources. To bridge the gap between two, we compile a Diverse Visuo-Lingustic Question Answering (DVLQA) challenge corpus, where the task is to derive joint inference about the given image-text modality in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information, i.e. ignoring either of them would make the question unanswerable. We first explore the combination of best existing deep learning architectures for visual question answering and machine comprehension to solve DVLQA subsets and show that they are unable to reason well on the joint task. We then develop a modular method which demonstrates slightly better baseline performance and offers more transparency for interpretation of intermediate outputs. However, this is still far behind the human performance, therefore we believe DVLQA will be a challenging benchmark for question answering involving reasoning over visuo-linguistic context. The dataset, code and public leaderboard will be made available at https://github.com/shailaja183/DVLQA.
Robust Question Answering Through Sub-part Alignment
Current textual question answering models achieve strong performance on in-domain test sets, but often do so by fitting surface-level patterns in the data, so they fail to generalize to out-of-distribution settings. To make a more robust and understandable QA system, we model question answering as an alignment problem. We decompose both the question and context into smaller units based on off-the-shelf semantic representations (here, semantic roles), and align the question to a subgraph of the context in order to find the answer. We formulate our model as a structured SVM, with alignment scores computed via BERT, and we can train end-to-end despite using beam search for approximate inference. Our explicit use of alignments allows us to explore a set of constraints with which we can prohibit certain types of bad model behavior arising in cross-domain settings. Furthermore, by investigating differences in scores across different potential answers, we can seek to understand what particular aspects of the input lead the model to choose the answer without relying on post-hoc explanation techniques. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets. The results show that our model is more robust cross-domain than the standard BERT QA model, and constraints derived from alignment scores allow us to effectively trade off coverage and accuracy.
Challenge Closed-book Science Exam: A Meta-learning Based Question Answering System
Zheng, Xinyue, Wang, Peng, Wang, Qigang, Shi, Zhongchao
Prior work in standardized science exams requires support from large text corpus, such as targeted science corpus from Wikipedia or SimpleWikipedia. However, retrieving knowledge from the large corpus is time-consuming and questions embedded in complex semantic representation may interfere with retrieval. Inspired by the dual process theory in cognitive science, we propose a MetaQA framework, where system 1 is an intuitive meta-classifier and system 2 is a reasoning module. Specifically, our method based on meta-learning method and large language model BERT, which can efficiently solve science problems by learning from related example questions without relying on external knowledge bases. We evaluate our method on AI2 Reasoning Challenge (ARC), and the experimental results show that meta-classifier yields considerable classification performance on emerging question types. The information provided by meta-classifier significantly improves the accuracy of reasoning module from 46.6% to 64.2%, which has a competitive advantage over retrieval-based QA methods.
Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs
Costa, Tarcísio Souza, Gottschalk, Simon, Demidova, Elena
Semantic Question Answering (QA) is the key technology to facilitate intuitive user access to semantic information stored in knowledge graphs. Whereas most of the existing QA systems and datasets focus on entity-centric questions, very little is known about the performance of these systems in the context of events. As new event-centric knowledge graphs emerge, datasets for such questions gain importance. In this paper we present the Event-QA dataset for answering event-centric questions over knowledge graphs. Event-QA contains 1000 semantic queries and the corresponding English, German and Portuguese verbalisations for EventKG - a recently proposed event-centric knowledge graph with over 970 thousand events.
Rapidly Bootstrapping a Question Answering Dataset for COVID-19
Tang, Raphael, Nogueira, Rodrigo, Zhang, Edwin, Gupta, Nikhil, Cam, Phuong, Cho, Kyunghyun, Lin, Jimmy
We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge. To our knowledge, this is the first publicly available resource of its type, and intended as a stopgap measure for guiding research until more substantial evaluation resources become available. While this dataset, comprising 124 question-article pairs as of the present version 0.1 release, does not have sufficient examples for supervised machine learning, we believe that it can be helpful for evaluating the zero-shot or transfer capabilities of existing models on topics specifically related to COVID-19. This paper describes our methodology for constructing the dataset and presents the effectiveness of a number of baselines, including term-based techniques and various transformer-based models. The dataset is available at http://covidqa.ai/
HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data
Chen, Wenhu, Zha, Hanwen, Chen, Zhiyu, Xiong, Wenhan, Wang, Hong, Wang, William
Existing question answering datasets focus on dealing with homogeneous information, based either only on text or KB/Table information alone. However, as human knowledge is distributed over heterogeneous forms, using homogeneous information might lead to severe coverage problems. To fill in the gap, we present \dataset, a new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a structured Wikipedia table and multiple free-form corpora linked with the entities in the table. The questions are designed to aggregate both tabular information and text information, i.e. lack of either form would render the question unanswerable. We test with three different models: 1) table-only model. 2) text-only model. 3) a hybrid model \model which combines both table and textual information to build a reasoning path towards the answer. The experimental results show that the first two baselines obtain compromised scores below 20\%, while \model significantly boosts EM score to over 50\%, which proves the necessity to aggregate both structure and unstructured information in \dataset. However, \model's score is still far behind human performance, hence we believe \dataset to an ideal and challenging benchmark to study question answering under heterogeneous information. The dataset and code are available at \url{https://github.com/wenhuchen/HybridQA}.
Explaining Question Answering Models through Text Generation
Latcinnik, Veronica, Berant, Jonathan
Large pre-trained language models (LMs) have been shown to perform surprisingly well when fine-tuned on tasks that require commonsense and world knowledge. However, in end-to-end architectures, it is difficult to explain what is the knowledge in the LM that allows it to make a correct prediction. In this work, we propose a model for multi-choice question answering, where a LM-based generator generates a textual hypothesis that is later used by a classifier to answer the question. The hypothesis provides a window into the information used by the fine-tuned LM that can be inspected by humans. A key challenge in this setup is how to constrain the model to generate hypotheses that are meaningful to humans. We tackle this by (a) joint training with a simple similarity classifier that encourages meaningful hypotheses, and (b) by adding loss functions that encourage natural text without repetitions. We show on several tasks that our model reaches performance that is comparable to end-to-end architectures, while producing hypotheses that elucidate the knowledge used by the LM for answering the question.
Natural Perturbation for Robust Question Answering
Khashabi, Daniel, Khot, Tushar, Sabharwal, Ashish
While recent models have achieved human-level scores on many NLP datasets, we observe that they are considerably sensitive to small changes in input. As an alternative to the standard approach of addressing this issue by constructing training sets of completely new examples, we propose doing so via minimal perturbation of examples. Specifically, our approach involves first collecting a set of seed examples and then applying human-driven natural perturbations (as opposed to rule-based machine perturbations), which often change the gold label as well. Local perturbations have the advantage of being relatively easier (and hence cheaper) to create than writing out completely new examples. To evaluate the impact of this phenomenon, we consider a recent question-answering dataset (BoolQ) and study the benefit of our approach as a function of the perturbation cost ratio, the relative cost of perturbing an existing question vs. creating a new one from scratch. We find that when natural perturbations are moderately cheaper to create, it is more effective to train models using them: such models exhibit higher robustness and better generalization, while retaining performance on the original BoolQ dataset.
Variational Question-Answer Pair Generation for Machine Reading Comprehension
Shinoda, Kazutoshi, Aizawa, Akiko
We present a deep generative model of question-answer (QA) pairs for machine reading comprehension. We introduce two independent latent random variables into our model in order to diversify answers and questions separately. We also study the effect of explicitly controlling the KL term in the variational lower bound in order to avoid the "posterior collapse" issue, where the model ignores latent variables and generates QA pairs that are almost the same. Our experiments on SQuAD v1.1 showed that variational methods can aid QA pair modeling capacity, and that the controlled KL term can significantly improve diversity while generating high-quality questions and answers comparable to those of the existing systems.
Guessing What's Plausible But Remembering What's True: Accurate Neural Reasoning for Question-Answering
Sun, Haitian, Arnold, Andrew O., Bedrax-Weiss, Tania, Pereira, Fernando, Cohen, William W.
Neural approaches to natural language processing (NLP) often fail at the logical reasoning needed for deeper language understanding. In particular, neural approaches to reasoning that rely on embedded \emph{generalizations} of a knowledge base (KB) implicitly model which facts that are \emph{plausible}, but may not model which facts are \emph{true}, according to the KB. While generalizing the facts in a KB is useful for KB completion, the inability to distinguish between plausible inferences and logically entailed conclusions can be problematic in settings like as KB question answering (KBQA). We propose here a novel KB embedding scheme that supports generalization, but also allows accurate logical reasoning with a KB. Our approach introduces two new mechanisms for KB reasoning: neural retrieval over a set of embedded triples, and "memorization" of highly specific information with a compact sketch structure. Experimentally, this leads to substantial improvements over the state-of-the-art on two KBQA benchmarks.