Question Answering
PeCoQ: A Dataset for Persian Complex Question Answering over Knowledge Graph
Etezadi, Romina, Shamsfard, Mehrnoush
Question answering systems may find the answers to users' questions from either unstructured texts or structured data such as knowledge graphs. Answering questions using supervised learning approaches including deep learning models need large training datasets. In recent years, some datasets have been presented for the task of Question answering over knowledge graphs, which is the focus of this paper. Although many datasets in English were proposed, there have been a few question-answering datasets in Persian. This paper introduces \textit{PeCoQ}, a dataset for Persian question answering. This dataset contains 10,000 complex questions and answers extracted from the Persian knowledge graph, FarsBase. For each question, the SPARQL query and two paraphrases that were written by linguists are provided as well. There are different types of complexities in the dataset, such as multi-relation, multi-entity, ordinal, and temporal constraints. In this paper, we discuss the dataset's characteristics and describe our methodology for building it.
A Neural-symbolic Approach for Ontology-mediated Query Answering
Andresel, Medina, Domokos, Csaba, Stepanova, Daria, Tran, Trung-Kien
Recently, low-dimensional vector space representations of knowledge graphs (KGs) have been applied to find answers to conjunctive queries (CQs) over incomplete KGs. However, the current methods only focus on inductive reasoning, i.e. answering CQs by predicting facts based on patterns learned from the data, and lack the ability of deductive reasoning by applying external domain knowledge. Such (expert or commonsense) domain knowledge is an invaluable resource which can be used to advance machine intelligence. To address this shortcoming, we introduce a neural-symbolic method for ontology-mediated CQ answering over incomplete KGs that operates in the embedding space. More specifically, we propose various data augmentation strategies to generate training queries using query-rewriting based methods and then exploit a novel loss function for training the model. The experimental results demonstrate the effectiveness of our training strategies and the new loss function, i.e., our method significantly outperforms the baseline in the settings that require both inductive and deductive reasoning.
Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering
Kim, Gangwoo, Kim, Hyunjae, Park, Jungsoo, Kang, Jaewoo
One of the main challenges in conversational question answering (CQA) is to resolve the conversational dependency, such as anaphora and ellipsis. However, existing approaches do not explicitly train QA models on how to resolve the dependency, and thus these models are limited in understanding human dialogues. In this paper, we propose a novel framework, ExCorD (Explicit guidance on how to resolve Conversational Dependency) to enhance the abilities of QA models in comprehending conversational context. ExCorD first generates self-contained questions that can be understood without the conversation history, then trains a QA model with the pairs of original and self-contained questions using a consistency-based regularizer. In our experiments, we demonstrate that ExCorD significantly improves the QA models' performance by up to 1.2 F1 on QuAC, and 5.2 F1 on CANARD, while addressing the limitations of the existing approaches.
Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering
Siriwardhana, Shamane, Weerasekera, Rivindu, Wen, Elliott, Nanayakkara, Suranga
In September 2020, Facebook open-sourced a new NLP model called Retrieval Augmented Generation (RAG) on the Hugging Face Transformer library. RAG is capable to use a set of support documents from an external knowledge base as a latent variable to generate the final output. The RAG model consists of an Input Encoder, a Neural Retriever, and an Output Generator. All three components are initialized with pre-trained transformers. However, the original Hugging Face implementation only allowed fine-tuning the Input Encoder and the Output Generator in an end-toend manner, while the Neural Retriever needs to be trained seperately. To the best of our knowledge, an end-to-end RAG implementation that trains all three components does not exist.
Query Embedding on Hyper-relational Knowledge Graphs
Alivanistos, Dimitrios, Berrendorf, Max, Cochez, Michael, Galkin, Mikhail
Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries. Existing algorithms operate only on classical, triple-based graphs, whereas modern KGs often employ a hyper-relational modeling paradigm. In this paradigm, typed edges may have several key-value pairs known as qualifiers that provide fine-grained context for facts. In queries, this context modifies the meaning of relations, and usually reduces the answer set. Hyper-relational queries are often observed in real-world KG applications, and existing approaches for approximate query answering cannot make use of qualifier pairs. In this work, we bridge this gap and extend the multi-hop reasoning problem to hyper-relational KGs allowing to tackle this new type of complex queries. Building upon recent advancements in Graph Neural Networks and query embedding techniques, we study how to embed and answer hyper-relational conjunctive queries. Besides that, we propose a method to answer such queries and demonstrate in our experiments that qualifiers improve query answering on a diverse set of query patterns.
An Intelligent Question Answering System based on Power Knowledge Graph
Tang, Yachen, Han, Haiyun, Yu, Xianmao, Zhao, Jing, Liu, Guangyi, Wei, Longfei
The intelligent question answering (IQA) system can accurately capture users' search intention by understanding the natural language questions, searching relevant content efficiently from a massive knowledge-base, and returning the answer directly to the user. Since the IQA system can save inestimable time and workforce in data search and reasoning, it has received more and more attention in data science and artificial intelligence. This article introduced a domain knowledge graph using the graph database and graph computing technologies from massive heterogeneous data in electric power. It then proposed an IQA system based on the electrical power knowledge graph to extract the intent and constraints of natural interrogation based on the natural language processing (NLP) method, to construct graph data query statements via knowledge reasoning, and to complete the accurate knowledge search and analysis to provide users with an intuitive visualization. This method thoroughly combined knowledge graph and graph computing characteristics, realized high-speed multi-hop knowledge correlation reasoning analysis in tremendous knowledge. The proposed work can also provide a basis for the context-aware intelligent question and answer.
Self-supervised Dialogue Learning for Spoken Conversational Question Answering
Chen, Nuo, You, Chenyu, Zou, Yuexian
In spoken conversational question answering (SCQA), the answer to the corresponding question is generated by retrieving and then analyzing a fixed spoken document, including multi-part conversations. Most SCQA systems have considered only retrieving information from ordered utterances. However, the sequential order of dialogue is important to build a robust spoken conversational question answering system, and the changes of utterances order may severely result in low-quality and incoherent corpora. To this end, we introduce a self-supervised learning approach, including incoherence discrimination, insertion detection, and question prediction, to explicitly capture the coreference resolution and dialogue coherence among spoken documents. Specifically, we design a joint learning framework where the auxiliary self-supervised tasks can enable the pre-trained SCQA systems towards more coherent and meaningful spoken dialogue learning. We also utilize the proposed self-supervised learning tasks to capture intra-sentence coherence. Experimental results demonstrate that our proposed method provides more coherent, meaningful, and appropriate responses, yielding superior performance gains compared to the original pre-trained language models. Our method achieves state-of-the-art results on the Spoken-CoQA dataset.
End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering
Sachan, Devendra Singh, Reddy, Siva, Hamilton, William, Dyer, Chris, Yogatama, Dani
We present an end-to-end differentiable training method for retrieval-augmented open-domain question answering systems that combine information from multiple retrieved documents when generating answers. We model retrieval decisions as latent variables over sets of relevant documents. Since marginalizing over sets of retrieved documents is computationally hard, we approximate this using an expectation-maximization algorithm. We iteratively estimate the value of our latent variable (the set of relevant documents for a given question) and then use this estimate to update the retriever and reader parameters. We hypothesize that such end-to-end training allows training signals to flow to the reader and then to the retriever better than staged-wise training. This results in a retriever that is able to select more relevant documents for a question and a reader that is trained on more accurate documents to generate an answer. Experiments on three benchmark datasets demonstrate that our proposed method outperforms all existing approaches of comparable size by 2-3% absolute exact match points, achieving new state-of-the-art results. Our results also demonstrate the feasibility of learning to retrieve to improve answer generation without explicit supervision of retrieval decisions.
Question Generation for Adaptive Education
Srivastava, Megha, Goodman, Noah
Intelligent and adaptive online education systems aim to make high-quality education available for a diverse range of students. However, existing systems usually depend on a pool of hand-made questions, limiting how fine-grained and open-ended they can be in adapting to individual students. We explore targeted question generation as a controllable sequence generation task. We first show how to fine-tune pre-trained language models for deep knowledge tracing (LM-KT). This model accurately predicts the probability of a student answering a question correctly, and generalizes to questions not seen in training. We then use LM-KT to specify the objective and data for training a model to generate questions conditioned on the student and target difficulty. Our results show we succeed at generating novel, well-calibrated language translation questions for second language learners from a real online education platform.
Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web
Xu, Silei, Campagna, Giovanni, Li, Jian, Lam, Monica S.
Building a question-answering agent currently requires large annotated datasets, which are prohibitively expensive. This paper proposes Schema2QA, an open-source toolkit that can generate a Q&A system from a database schema augmented with a few annotations for each field. The key concept is to cover the space of possible compound queries on the database with a large number of in-domain questions synthesized with the help of a corpus of generic query templates. The synthesized data and a small paraphrase set are used to train a novel neural network based on the BERT pretrained model. We use Schema2QA to generate Q&A systems for five Schema.org domains, restaurants, people, movies, books and music, and obtain an overall accuracy between 64% and 75% on crowdsourced questions for these domains. Once annotations and paraphrases are obtained for a Schema.org schema, no additional manual effort is needed to create a Q&A agent for any website that uses the same schema. Furthermore, we demonstrate that learning can be transferred from the restaurant to the hotel domain, obtaining a 64% accuracy on crowdsourced questions with no manual effort. Schema2QA achieves an accuracy of 60% on popular restaurant questions that can be answered using Schema.org. Its performance is comparable to Google Assistant, 7% lower than Siri, and 15% higher than Alexa. It outperforms all these assistants by at least 18% on more complex, long-tail questions.