Question Answering
Toward a Unified Framework for Unsupervised Complex Tabular Reasoning
Li, Zhenyu, Li, Xiuxing, Duan, Zhichao, Dong, Bowen, Liu, Ning, Wang, Jianyong
Structured tabular data exist across nearly all fields. Reasoning task over these data aims to answer questions or determine the truthiness of hypothesis sentences by understanding the semantic meaning of a table. While previous works have devoted significant efforts to the tabular reasoning task, they always assume there are sufficient labeled data. However, constructing reasoning samples over tables (and related text) is labor-intensive, especially when the reasoning process is complex. When labeled data is insufficient, the performance of models will suffer an unendurable decline. In this paper, we propose a unified framework for unsupervised complex tabular reasoning (UCTR), which generates sufficient and diverse synthetic data with complex logic for tabular reasoning tasks, assuming no human-annotated data at all. We first utilize a random sampling strategy to collect diverse programs of different types and execute them on tables based on a "Program-Executor" module. To bridge the gap between the programs and natural language sentences, we design a powerful "NL-Generator" module to generate natural language sentences with complex logic from these programs. Since a table often occurs with its surrounding texts, we further propose novel "Table-to-Text" and "Text-to-Table" operators to handle joint table-text reasoning scenarios. This way, we can adequately exploit the unlabeled table resources to obtain a well-performed reasoning model under an unsupervised setting. Our experiments cover different tasks (question answering and fact verification) and different domains (general and specific), showing that our unsupervised methods can achieve at most 93% performance compared to supervised models. We also find that it can substantially boost the supervised performance in low-resourced domains as a data augmentation technique. Our code is available at https://github.com/leezythu/UCTR.
To Adapt or to Annotate: Challenges and Interventions for Domain Adaptation in Open-Domain Question Answering
Dua, Dheeru, Strubell, Emma, Singh, Sameer, Verga, Pat
Recent advances in open-domain question answering (ODQA) have demonstrated impressive accuracy on standard Wikipedia style benchmarks. However, it is less clear how robust these models are and how well they perform when applied to real-world applications in drastically different domains. While there has been some work investigating how well ODQA models perform when tested for out-of-domain (OOD) generalization, these studies have been conducted only under conservative shifts in data distribution and typically focus on a single component (ie. retrieval) rather than an end-to-end system. In response, we propose a more realistic and challenging domain shift evaluation setting and, through extensive experiments, study end-to-end model performance. We find that not only do models fail to generalize, but high retrieval scores often still yield poor answer prediction accuracy. We then categorize different types of shifts and propose techniques that, when presented with a new dataset, predict if intervention methods are likely to be successful. Finally, using insights from this analysis, we propose and evaluate several intervention methods which improve end-to-end answer F1 score by up to 24 points.
Bridging The Gap: Entailment Fused-T5 for Open-retrieval Conversational Machine Reading Comprehension
Zhang, Xiao, Huang, Heyan, Chi, Zewen, Mao, Xian-Ling
Open-retrieval conversational machine reading comprehension (OCMRC) simulates reallife conversational interaction scenes. Machines are required to make a decision of Yes/No/Inquire or generate a follow-up question when the decision is Inquire based on retrieved rule texts, user scenario, user question, and dialogue history. Recent studies explored the methods to reduce the information gap between decision-making and question generation and thus improve the performance of generation. However, the information gap still exists because these pipeline structures are still limited in decision-making, span extraction, and question rephrasing three stages. Decision-making and generation are reasoning separately, and the entailment reasoning utilized in decision-making is hard to share through all stages. To tackle the above problem, we proposed a novel one-stage endto-end framework, called Entailment Fused-Figure 1: An example in the OCMRC dataset. Given T5 (EFT), to bridge the information gap between the user scenario and user question, machines are decision-making and generation in a required to first retrieve related rule texts in the global understanding manner. The extensive knowledge database, and then make a decision of experimental results demonstrate that our proposed Yes/No/Inquire or generate a follow-up question framework achieves new state-of-the-art when the decision is Inquire based on retrieved rule performance on the OR-ShARC benchmark.
Source-Free Domain Adaptation for Question Answering with Masked Self-training
Yin, M., Wang, B., Dong, Y., Ling, C.
Most previous unsupervised domain adaptation (UDA) methods for question answering(QA) require access to source domain data while fine-tuning the model for the target domain. Source domain data may, however, contain sensitive information and may be restricted. In this study, we investigate a more challenging setting, source-free UDA, in which we have only the pretrained source model and target domain data, without access to source domain data. We propose a novel self-training approach to QA models that integrates a unique mask module for domain adaptation. The mask is auto-adjusted to extract key domain knowledge while trained on the source domain. To maintain previously learned domain knowledge, certain mask weights are frozen during adaptation, while other weights are adjusted to mitigate domain shifts with pseudo-labeled samples generated in the target domain. %As part of the self-training process, we generate pseudo-labeled samples in the target domain based on models trained in the source domain. Our empirical results on four benchmark datasets suggest that our approach significantly enhances the performance of pretrained QA models on the target domain, and even outperforms models that have access to the source data during adaptation.
Cross-Lingual Open-Domain Question Answering with Answer Sentence Generation
Muller, Benjamin, Soldaini, Luca, Koncel-Kedziorski, Rik, Lind, Eric, Moschitti, Alessandro
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
Task Preferences across Languages on Community Question Answering Platforms
Santy, Sebastin, Bhattacharya, Prasanta, Mehrotra, Rishabh
With the steady emergence of community question answering (CQA) platforms like Quora, StackExchange, and WikiHow, users now have an unprecedented access to information on various kind of queries and tasks. Moreover, the rapid proliferation and localization of these platforms spanning geographic and linguistic boundaries offer a unique opportunity to study the task requirements and preferences of users in different socio-linguistic groups. In this study, we implement an entity-embedding model trained on a large longitudinal dataset of multi-lingual and task-oriented question-answer pairs to uncover and quantify the (i) prevalence and distribution of various online tasks across linguistic communities, and (ii) emerging and receding trends in task popularity over time in these communities. Our results show that there exists substantial variance in task preference as well as popularity trends across linguistic communities on the platform. Findings from this study will help Q&A platforms better curate and personalize content for non-English users, while also offering valuable insights to businesses looking to target non-English speaking communities online.
Improving Question Answering Performance through Manual Annotation: Costs, Benefits and Strategies
Rybak, Piotr, Przybyła, Piotr, Ogrodniczuk, Maciej
Recently proposed systems for open-domain question answering (OpenQA) require large amounts of training data to achieve state-of-the-art performance. However, data annotation is known to be time-consuming and therefore expensive to acquire. As a result, the appropriate datasets are available only for a handful of languages (mainly English and Chinese). In this work, we introduce and publicly release PolQA, the first Polish dataset for OpenQA. It consists of 7,000 questions, 87,525 manually labeled evidence passages, and a corpus of over 7,097,322 candidate passages. Each question is classified according to its formulation, type, as well as entity type of the answer. This resource allows us to evaluate the impact of different annotation choices on the performance of the QA system and propose an efficient annotation strategy that increases the passage retrieval performance by 10.55 p.p. while reducing the annotation cost by 82%.
Towards leveraging latent knowledge and Dialogue context for real-world conversational question answering
In many real-world scenarios, the absence of external knowledge source like Wikipedia restricts question answering systems to rely on latent internal knowledge in limited dialogue data. In addition, humans often seek answers by asking several questions for more comprehensive information. As the dialog becomes more extensive, machines are challenged to refer to previous conversation rounds to answer questions. In this work, we propose to leverage latent knowledge in existing conversation logs via a neural Retrieval-Reading system, enhanced with a TFIDF-based text summarizer refining lengthy conversational history to alleviate the long context issue. Our experiments show that our Retrieval-Reading system can exploit retrieved background knowledge to generate significantly better answers. The results also indicate that our context summarizer significantly helps both the retriever and the reader by introducing more concise and less noisy contextual information.
Natural Language Processing in Customer Service: A Systematic Review
Mashaabi, Malak, Alotaibi, Areej, Qudaih, Hala, Alnashwan, Raghad, Al-Khalifa, Hend
Artificial intelligence and natural language processing (NLP) are increasingly being used in customer service to interact with users and answer their questions. The goal of this systematic review is to examine existing research on the use of NLP technology in customer service, including the research domain, applications, datasets used, and evaluation methods. The review also looks at the future direction of the field and any significant limitations. The review covers the time period from 2015 to 2022 and includes papers from five major scientific databases. Chatbots and question-answering systems were found to be used in 10 main fields, with the most common use in general, social networking, and e-commerce areas. Twitter was the second most commonly used dataset, with most research also using their own original datasets. Accuracy, precision, recall, and F1 were the most common evaluation methods. Future work aims to improve the performance and understanding of user behavior and emotions, and address limitations such as the volume, diversity, and quality of datasets. This review includes research on different spoken languages and models and techniques.
DialogQAE: N-to-N Question Answer Pair Extraction from Customer Service Chatlog
Zheng, Xin, Liu, Tianyu, Meng, Haoran, Wang, Xu, Jiang, Yufan, Rao, Mengliang, Lin, Binghuai, Sui, Zhifang, Cao, Yunbo
Harvesting question-answer (QA) pairs from customer service chatlog in the wild is an efficient way to enrich the knowledge base for customer service chatbots in the cold start or continuous integration scenarios. Prior work attempts to obtain 1-to-1 QA pairs from growing customer service chatlog, which fails to integrate the incomplete utterances from the dialog context for composite QA retrieval. In this paper, we propose N-to-N QA extraction task in which the derived questions and corresponding answers might be separated across different utterances. We introduce a suite of generative/discriminative tagging based methods with end-to-end and two-stage variants that perform well on 5 customer service datasets and for the first time setup a benchmark for N-to-N DialogQAE with utterance and session level evaluation metrics. With a deep dive into extracted QA pairs, we find that the relations between and inside the QA pairs can be indicators to analyze the dialogue structure, e.g. information seeking, clarification, barge-in and elaboration. We also show that the proposed models can adapt to different domains and languages, and reduce the labor cost of knowledge accumulation in the real-world product dialogue platform.