Question Answering
Consecutive Question Generation via Dynamic Multitask Learning
Li, Yunji, Li, Sujian, Shi, Xing
In this paper, we propose the task of consecutive question generation (CQG), which generates a set of logically related question-answer pairs to understand a whole passage, with a comprehensive consideration of the aspects including accuracy, coverage, and informativeness. To achieve this, we first examine the four key elements of CQG, i.e., question, answer, rationale, and context history, and propose a novel dynamic multitask framework with one main task generating a question-answer pair, and four auxiliary tasks generating other elements. It directly helps the model generate good questions through both joint training and self-reranking. At the same time, to fully explore the worth-asking information in a given passage, we make use of the reranking losses to sample the rationales and search for the best question series globally. Finally, we measure our strategy by QA data augmentation and manual evaluation, as well as a novel application of generated question-answer pairs on DocNLI. We prove that our strategy can improve question generation significantly and benefit multiple related NLP tasks.
MapQA: A Dataset for Question Answering on Choropleth Maps
Chang, Shuaichen, Palzer, David, Li, Jialin, Fosler-Lussier, Eric, Xiao, Ningchuan
Choropleth maps are a common visual representation for region-specific tabular data and are used in a number of different venues (newspapers, articles, etc). These maps are human-readable but are often challenging to deal with when trying to extract data for screen readers, analyses, or other related tasks. Recent research into Visual-Question Answering (VQA) has studied question answering on human-generated charts (ChartQA), such as bar, line, and pie charts. However, little work has paid attention to understanding maps; general VQA models, and ChartQA models, suffer when asked to perform this task. To facilitate and encourage research in this area, we present MapQA, a large-scale dataset of ~800K question-answer pairs over ~60K map images. Our task tests various levels of map understanding, from surface questions about map styles to complex questions that require reasoning on the underlying data. We present the unique challenges of MapQA that frustrate most strong baseline algorithms designed for ChartQA and general VQA tasks. We also present a novel algorithm, Visual Multi-Output Data Extraction based QA (V-MODEQA) for MapQA. V-MODEQA extracts the underlying structured data from a map image with a multi-output model and then performs reasoning on the extracted data. Our experimental results show that V-MODEQA has better overall performance and robustness on MapQA than the state-of-the-art ChartQA and VQA algorithms by capturing the unique properties in map question answering.
Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain Question Answering
He, Wanrong, Mao, Andrew, Boyd-Graber, Jordan
For humans and computers, the first step in answering an open-domain question is retrieving a set of relevant documents from a large corpus. However, the strategies that computers use fundamentally differ from those of humans. To better understand these differences, we design a gamified interface for data collection -- Cheater's Bowl -- where a human answers complex questions with access to both traditional and modern search tools. We collect a dataset of human search sessions, analyze human search strategies, and compare them to state-of-the-art multi-hop QA models. Humans query logically, apply dynamic search chains, and use world knowledge to boost searching. We demonstrate how human queries can improve the accuracy of existing systems and propose improving the future design of QA models.
Reasoning Circuits: Few-shot Multihop Question Generation with Structured Rationales
Kulshreshtha, Saurabh, Rumshisky, Anna
Multi-hop Question Generation is the task of generating questions which require the reader to reason over and combine information spread across multiple passages using several reasoning steps. Chain-of-thought rationale generation has been shown to improve performance on multi-step reasoning tasks and make model predictions more interpretable. However, few-shot performance gains from including rationales have been largely observed only in +100B language models, and otherwise require large scale manual rationale annotation. In this work, we introduce a new framework for applying chain-of-thought inspired structured rationale generation to multi-hop question generation under a very low supervision regime (8- to 128-shot). We propose to annotate a small number of examples following our proposed multi-step rationale schema, treating each reasoning step as a separate task to be performed by a generative language model. We show that our framework leads to improved control over the difficulty of the generated questions and better performance compared to baselines trained without rationales, both on automatic evaluation metrics and in human evaluation. Importantly, we show that this is achievable with a modest model size.
A Comparative Study of Question Answering over Knowledge Bases
Tran, Khiem Vinh, Phan, Hao Phu, Quach, Khang Nguyen Duc, Nguyen, Ngan Luu-Thuy, Jo, Jun, Nguyen, Thanh Tam
Question answering over knowledge bases (KBQA) has become a popular approach to help users extract information from knowledge bases. Although several systems exist, choosing one suitable for a particular application scenario is difficult. In this article, we provide a comparative study of six representative KBQA systems on eight benchmark datasets. In that, we study various question types, properties, languages, and domains to provide insights on where existing systems struggle. On top of that, we propose an advanced mapping algorithm to aid existing models in achieving superior results. Moreover, we also develop a multilingual corpus COVID-KGQA, which encourages COVID-19 research and multilingualism for the diversity of future AI. Finally, we discuss the key findings and their implications as well as performance guidelines and some future improvements. Our source code is available at \url{https://github.com/tamlhp/kbqa}.
Generative Long-form Question Answering: Relevance, Faithfulness and Succinctness
In this thesis, we investigated the relevance, faithfulness, and succinctness aspects of Long Form Question Answering (LFQA). LFQA aims to generate an in-depth, paragraph-length answer for a given question, to help bridge the gap between real scenarios and the existing open-domain QA models which can only extract short-span answers. LFQA is quite challenging and under-explored. Few works have been done to build an effective LFQA system. It is even more challenging to generate a good-quality long-form answer relevant to the query and faithful to facts, since a considerable amount of redundant, complementary, or contradictory information will be contained in the retrieved documents. Moreover, no prior work has been investigated to generate succinct answers. We are among the first to research the LFQA task. We pioneered the research direction to improve the answer quality in terms of 1) query-relevance, 2) answer faithfulness, and 3) answer succinctness.
Exploring Dual Encoder Architectures for Question Answering
Dong, Zhe, Ni, Jianmo, Bikel, Daniel M., Alfonseca, Enrique, Wang, Yuan, Qu, Chen, Zitouni, Imed
Dual encoders have been used for question-answering (QA) and information retrieval (IR) tasks with good results. Previous research focuses on two major types of dual encoders, Siamese Dual Encoder (SDE), with parameters shared across two encoders, and Asymmetric Dual Encoder (ADE), with two distinctly parameterized encoders. In this work, we explore different ways in which the dual encoder can be structured, and evaluate how these differences can affect their efficacy in terms of QA retrieval tasks. By evaluating on MS MARCO, open domain NQ and the MultiReQA benchmarks, we show that SDE performs significantly better than ADE. We further propose three different improved versions of ADEs by sharing or freezing parts of the architectures between two encoder towers. We find that sharing parameters in projection layers would enable ADEs to perform competitively with or outperform SDEs. We further explore and explain why parameter sharing in projection layer significantly improves the efficacy of the dual encoders, by directly probing the embedding spaces of the two encoder towers with t-SNE algorithm.
Week 1-- FlashCards
Hi, we are a two-student group that will be trying to create an ML model for their AIN311 course. This is the first of the many blog posts we will publish regarding this project. Stay tuned for a new post every Sunday. As everybody knows AI changes our lives for the better day by day. As two AI Engineering students, we thought we could kill two birds with one stone and create a project for our course which could help us study better while saving us time.
Towards Robust Numerical Question Answering: Diagnosing Numerical Capabilities of NLP Systems
Xu, Jialiang, Zhou, Mengyu, He, Xinyi, Han, Shi, Zhang, Dongmei
Numerical Question Answering is the task of answering questions that require numerical capabilities. Previous works introduce general adversarial attacks to Numerical Question Answering, while not systematically exploring numerical capabilities specific to the topic. In this paper, we propose to conduct numerical capability diagnosis on a series of Numerical Question Answering systems and datasets. A series of numerical capabilities are highlighted, and corresponding dataset perturbations are designed. Empirical results indicate that existing systems are severely challenged by these perturbations. E.g., Graph2Tree experienced a 53.83% absolute accuracy drop against the ``Extra'' perturbation on ASDiv-a, and BART experienced 13.80% accuracy drop against the ``Language'' perturbation on the numerical subset of DROP. As a counteracting approach, we also investigate the effectiveness of applying perturbations as data augmentation to relieve systems' lack of robust numerical capabilities. With experiment analysis and empirical studies, it is demonstrated that Numerical Question Answering with robust numerical capabilities is still to a large extent an open question. We discuss future directions of Numerical Question Answering and summarize guidelines on future dataset collection and system design.
Learning to Answer Multilingual and Code-Mixed Questions
Question-answering (QA) that comes naturally to humans is a critical component in seamless human-computer interaction. It has emerged as one of the most convenient and natural methods to interact with the web and is especially desirable in voice-controlled environments. Despite being one of the oldest research areas, the current QA system faces the critical challenge of handling multilingual queries. To build an Artificial Intelligent (AI) agent that can serve multilingual end users, a QA system is required to be language versatile and tailored to suit the multilingual environment. Recent advances in QA models have enabled surpassing human performance primarily due to the availability of a sizable amount of high-quality datasets. However, the majority of such annotated datasets are expensive to create and are only confined to the English language, making it challenging to acknowledge progress in foreign languages. Therefore, to measure a similar improvement in the multilingual QA system, it is necessary to invest in high-quality multilingual evaluation benchmarks. In this dissertation, we focus on advancing QA techniques for handling end-user queries in multilingual environments. This dissertation consists of two parts. In the first part, we explore multilingualism and a new dimension of multilingualism referred to as code-mixing. Second, we propose a technique to solve the task of multi-hop question generation by exploiting multiple documents. Experiments show our models achieve state-of-the-art performance on answer extraction, ranking, and generation tasks on multiple domains of MQA, VQA, and language generation. The proposed techniques are generic and can be widely used in various domains and languages to advance QA systems.