Leveraging External Knowledge Resources to Enable Domain-Specific Comprehension

Sengupta, Saptarshi, Heaton, Connor, Mitra, Prasenjit, Sarkar, Soumalya

arXiv.org Artificial Intelligence 

Machine Reading Comprehension (MRC) has been a long-standing problem in NLP and, with the recent introduction of the BERT family of transformer based language models, it has come a long way to getting solved. Unfortunately, however, when BERT variants trained on general text corpora are applied to domain-specific text, their performance inevitably degrades on account of the domain shift i.e. genre/subject matter discrepancy between the training and downstream application data. Knowledge graphs act as reservoirs for either open or closed domain information and prior studies have shown that they can be used to improve the performance of general-purpose transformers in domain-specific applications. Building on existing work, we introduce a method using Multi-Layer Perceptrons (MLPs) for aligning and integrating embeddings extracted from knowledge graphs with the embeddings spaces of pre-trained language models (LMs). We fuse the aligned embeddings with open-domain LMs BERT and RoBERTa, and fine-tune them for two MRC tasks namely span detection (COVID-QA) and multiple-choice questions (PubMedQA). On the COVID-QA dataset, we see that our approach allows these models to perform similar to their domain-specific counterparts, Bio/Sci-BERT, as evidenced by the Exact Match (EM) metric. With regards to PubMedQA, we observe an overall improvement in accuracy while the F1 stays relatively the same over the domain-specific models. MRC is defined as a class of supervised question answering (QA) problems wherein a system learns a function to answer a question given an associated passage(s), i.e. given a question and context text, select the answer to the question from within the context. Mathematically, MRC: f(C,Q) A where C is the relevant context, Q is the question andAis the answer space to be learned (Liu et al., 2019). Reading comprehension is one of the most challenging areas of NLP since a system needs to manage with multiple facets of language (identifying entities, supporting facts in context, the intent of the question, etc.) to answer correctly. Fortunately, with the introduction of the Transformer (Vaswani et al., 2017) and subsequent BERT (Devlin et al., 2019) family of models (Rogers et al., 2020), the state-of-the-art in MRC has moved forward by leaps and bounds.