cord-19
LLM-based feature generation from text for interpretable machine learning
Balek, Vojtěch, Sýkora, Lukáš, Sklenák, Vilém, Kliegr, Tomáš
Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text. We demonstrate this process on two datasets (CORD-19 and M17+) containing several thousand scientific articles from multiple disciplines and a target being a proxy for research impact. An evaluation based on testing for the statistically significant correlation with research impact has shown that LLama 2-generated features are semantically meaningful. We consequently used these generated features in text classification to predict the binary target variable representing the citation rate for the CORD-19 dataset and the ordinal 5-class target representing an expert-awarded grade in the M17+ dataset. Machine-learning models trained on the LLM-generated features provided similar predictive performance to the state-of-the-art embedding model SciBERT for scientific text. The LLM used only 62 features compared to 768 features in SciBERT embeddings, and these features were directly interpretable, corresponding to notions such as article methodological rigor, novelty, or grammatical correctness. As the final step, we extract a small number of well-interpretable action rules. Consistently competitive results obtained with the same LLM feature set across both thematically diverse datasets show that this approach generalizes across domains.
An Information Retrieval and Extraction Tool for Covid-19 Related Papers
Background: The COVID-19 pandemic has caused severe impacts on health systems worldwide. Its critical nature and the increased interest of individuals and organizations to develop countermeasures to the problem has led to a surge of new studies in scientific journals. Objetive: We sought to develop a tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this paper is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and hightlight relevant entities in text. Method: We applied Latent Dirichlet Allocation (LDA) to model, based on research aspects, the topics of all English abstracts in CORD-19. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers. Results: Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool. Conclusion: We emphasize the need of new automated tools to help researchers find relevant COVID-19 documents, in addition to automatically extracting useful information contained in them. Our work suggests that combining different algorithms and models could lead to new ways of browsing COVID-19 paper data.
Bringing IBM NLP capabilities to the CORD-19 Dataset
To assist in the fight against the COVID-19 pandemic, prominent research institutes led by Allen Institute for AI (AI2) released earlier this year the COVID-19 Open Research Dataset (CORD-19). Comprised of scientific articles related to COVID-19, Sars-Cov-2, and related coronaviruses, the dataset (which at the time of writing this contains more than 75,000 full text scientific papers) is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease (1,2). While a tremendous resource, the dataset initially did not include information found in tables due to the difficulty of extracting tabular data. However, following the launch of the Kaggle challenge associated with CORD-19, table information rose to become the most requested feature by challenge participants. Recognizing that critical scientific facts and data are often organized in a tabular format, IBM Research AI offered to apply our extensive experience in document and table conversion to update the CORD-19 dataset and, in turn, open up additional critical information to the global science and medical community in efforts to fight COVID-19.
Answering Questions on COVID-19 in Real-Time
Lee, Jinhyuk, Yi, Sean S., Jeong, Minbyul, Sung, Mujeen, Yoon, Wonjin, Choi, Yonghwa, Ko, Miyoung, Kang, Jaewoo
The recent outbreak of the novel coronavirus is wreaking havoc on the world and researchers are struggling to effectively combat it. One reason why the fight is difficult is due to the lack of information and knowledge. In this work, we outline our effort to contribute to shrinking this knowledge vacuum by creating covidAsk, a question answering (QA) system that combines biomedical text mining and QA techniques to provide answers to questions in real-time. Our system leverages both supervised and unsupervised approaches to provide informative answers using DenSPI (Seo et al., 2019) and BEST (Lee et al., 2016). Evaluation of covidAsk is carried out by using a manually created dataset called COVID-19 Questions which is based on facts about COVID-19. We hope our system will be able to aid researchers in their search for knowledge and information not only for COVID-19 but for future pandemics as well.
A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19
COVID-19 has resulted in an ongoing pandemic and as of 12 June 2020, has caused more than 7.4 million cases and over 418,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf, BERT, BioBERT, and USE to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online.
New tools aim to tame pandemic paper tsunami
Science's COVID-19 coverage is supported by the Pulitzer Center. Timothy Sheahan, a virologist studying COVID-19, wishes he could keep pace with the growing torrent of new scientific papers related to the pandemic. But there have just been too many--more than 5000 papers a week. "I'm not keeping up," says Sheahan, who works at the University of North Carolina, Chapel Hill. A loose-knit army of data scientists and software developers is pressing hard to change that.
Rapidly Bootstrapping a Question Answering Dataset for COVID-19
Tang, Raphael, Nogueira, Rodrigo, Zhang, Edwin, Gupta, Nikhil, Cam, Phuong, Cho, Kyunghyun, Lin, Jimmy
We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge. To our knowledge, this is the first publicly available resource of its type, and intended as a stopgap measure for guiding research until more substantial evaluation resources become available. While this dataset, comprising 124 question-article pairs as of the present version 0.1 release, does not have sufficient examples for supervised machine learning, we believe that it can be helpful for evaluating the zero-shot or transfer capabilities of existing models on topics specifically related to COVID-19. This paper describes our methodology for constructing the dataset and presents the effectiveness of a number of baselines, including term-based techniques and various transformer-based models. The dataset is available at http://covidqa.ai/
Fighting the Covid-19: All the datasets and data efforts in one place
Since the corona erupted into our world, research institutes and governments have released many databases publicly to allow research groups (and independent individuals) to analyze the data around the corona's spread. These databases are scattered under numerous initiatives and sources. The purpose of this blog is to organize all the major open databases and data initiatives around the world. Feel free to add it in the comments or through this form. In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19).
CORD-19: Database of scientific articles launched to help AI fight COVID-19
Earlier this week, five organisations released an open dataset – CORD-19 – containing nearly 30 000 scientific articles with the hopes that artificial intelligence will be able to use the data and combat the spread of COVID-19 infections. These articles have previously been published in journals, or were saved on pre-print servers. CORD-19 is short for COVID-19 Open Research Data set. The CORD-19 dataset was released after the Trump administration issued a "call to action" for the tech community to develop AI (artificial intelligence) techniques to curb the spread of COVID-19 infections. In addition, Michael Kratsios, US Chief Technology Officer at The White House, explained that "decisive action from America's science and technology enterprise" was needed to prevent, detect, treat and develop a cure for COVID-19.
CORD-19: Database of scientific articles launched to help AI fight COVID-19
Earlier this week, five organisations released an open dataset – CORD-19 – containing nearly 30 000 scientific articles with the hopes that artificial intelligence will be able to use the data and combat the spread of COVID-19 infections. These articles have previously been published in journals, or were saved on pre-print servers. CORD-19 is short for COVID-19 Open Research Data set. The CORD-19 dataset was released after the Trump administration issued a "call to action" for the tech community to develop AI (artificial intelligence) techniques to curb the spread of COVID-19 infections. In addition, Michael Kratsios, US Chief Technology Officer at The White House, explained that "decisive action from America's science and technology enterprise" was needed to prevent, detect, treat and develop a cure for COVID-19.