Goto

Collaborating Authors

 cord-19 dataset


Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization

Patel, Divya, Parikh, Vansh, Patel, Om, Shah, Agam, Chaudhury, Bhaskar

arXiv.org Artificial Intelligence

In this work, we apply topic modeling using Non-Negative Matrix Factorization (NMF) on the COVID-19 Open Research Dataset (CORD-19) to uncover the underlying thematic structure and its evolution within the extensive body of COVID-19 research literature. NMF factorizes the document-term matrix into two non-negative matrices, effectively representing the topics and their distribution across the documents. This helps us see how strongly documents relate to topics and how topics relate to words. We describe the complete methodology which involves a series of rigorous pre-processing steps to standardize the available text data while preserving the context of phrases, and subsequently feature extraction using the term frequency-inverse document frequency (tf-idf), which assigns weights to words based on their frequency and rarity in the dataset. To ensure the robustness of our topic model, we conduct a stability analysis. This process assesses the stability scores of the NMF topic model for different numbers of topics, enabling us to select the optimal number of topics for our analysis. Through our analysis, we track the evolution of topics over time within the CORD-19 dataset. Our findings contribute to the understanding of the knowledge structure of the COVID-19 research landscape, providing a valuable resource for future research in this field.


Constructing the CORD-19 Vaccine Dataset

Singh, Manisha, Sharma, Divy, Ma, Alonso, Tyree, Bridget, Mitchell, Margaret

arXiv.org Artificial Intelligence

We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one used in the CORD-19 Kaggle challenge [Goldbloom et al., 2022]. For further evaluation, sequential sentence classification was performed on each paper's abstract using the model from Dernoncourt et al. [2016]. We partially hand annotated the training dataset and used a pre-trained BERT-PubMed layer. 'CORD- 19-Vaccination' contains 30k research papers and can be immensely valuable for NLP research such as text mining, information extraction, and question answering, specific to the domain of COVID-19 vaccine research.


COV19IR : COVID-19 Domain Literature Information Retrieval

Bose, Arusarka, Zhou, Zili, Xu, Guandong

arXiv.org Artificial Intelligence

Increasing number of COVID-19 research literatures cause new challenges in effective literature screening and COVID-19 domain knowledge aware Information Retrieval. To tackle the challenges, we demonstrate two tasks along withsolutions, COVID-19 literature retrieval, and question answering. COVID-19 literature retrieval task screens matching COVID-19 literature documents for textual user query, and COVID-19 question answering task predicts proper text fragments from text corpus as the answer of specific COVID-19 related questions. Based on transformer neural network, we provided solutions to implement the tasks on CORD-19 dataset, we display some examples to show the effectiveness of our proposed solutions.


Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection

Wahle, Jan Philip, Ashok, Nischal, Ruas, Terry, Meuschke, Norman, Ghosal, Tirthankar, Gipp, Bela

arXiv.org Artificial Intelligence

A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic. Computational support to identify false information within the massive body of data on the topic is crucial to prevent harm. Researchers proposed many methods for flagging online misinformation related to COVID-19. However, these methods predominantly target specific content types (e.g., news) or platforms (e.g., Twitter). The methods' capabilities to generalize were largely unclear so far. We evaluate fifteen Transformer-based models on five COVID-19 misinformation datasets that include social media posts, news articles, and scientific papers to fill this gap. We show tokenizers and models tailored to COVID-19 data do not provide a significant advantage over general-purpose ones. Our study provides a realistic assessment of models for detecting COVID-19 misinformation. We expect that evaluating a broad spectrum of datasets and models will benefit future research in developing misinformation detection systems.


Gaining a sense of control over the COVID-19 pandemic

#artificialintelligence

How one Kaggler took top marks across multiple Covid-related challenges. Today we interview Daniel, whose notebooks earned him top marks in Kaggle's CORD-19 challenges. Kaggle hosted multiple challenges that worked with the Kaggle CORD-19 dataset, and Daniel won 1st place three times, including by a huge margin in the TREC-COVID challenge. My research interests include probabilistic forecasting, causal inference and machine learning. As part of the Kaggle CORD-19 challenge I developed discovid.ai I'm also a student assistant where I've worked on several data science projects for the last 3 years and had the opportunity to work with real world data from different companies in highly diverse domains -- from predicting the waste in a sawmill to analyzing flaws in the process of surface galvanization and testing the efficiency of a marketing campaign.


When his hobbies went on hiatus, this Kaggler made fighting COVID-19 with data his mission

#artificialintelligence

Most medical articles have methods & results sections and matches in those sections are more important. I had little to no expectations entering this competition, so I wouldn't say I was surprised by anything. It was great to see so many smart and capable people all working together to try to help in whatever way they could. All of the work is driven by the Kaggle platform. The list of notebooks cover all the submissions for Round 1 and Round 2 of the CORD-19 challenge. All of the notebooks are in Python.


How Elsevier Accelerated COVID-19 research using Dask on Saturn Cloud -- Elsevier Labs

#artificialintelligence

The version of CORD-19 that we used yielded 3,389,064 paragraphs and 16,952,279 sentences. Each sentence is sent to each model and yields zero or more entities. A notable point is that the process of generating entities from sentences is embarrassingly parallel, and therefore processing multiple sentences in parallel achieves savings in processing time. . To process the dataset, we used Dask, an open source library for parallel computing in Python. Dask provides multiple convenient abstractions that mimic familiar APIs such as Numpy and Pandas Dataframes, which can operate on datasets that do not fit in main memory.


Bringing IBM NLP capabilities to the CORD-19 Dataset

#artificialintelligence

To assist in the fight against the COVID-19 pandemic, prominent research institutes led by Allen Institute for AI (AI2) released earlier this year the COVID-19 Open Research Dataset (CORD-19). Comprised of scientific articles related to COVID-19, Sars-Cov-2, and related coronaviruses, the dataset (which at the time of writing this contains more than 75,000 full text scientific papers) is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease (1,2). While a tremendous resource, the dataset initially did not include information found in tables due to the difficulty of extracting tabular data. However, following the launch of the Kaggle challenge associated with CORD-19, table information rose to become the most requested feature by challenge participants. Recognizing that critical scientific facts and data are often organized in a tabular format, IBM Research AI offered to apply our extensive experience in document and table conversion to update the CORD-19 dataset and, in turn, open up additional critical information to the global science and medical community in efforts to fight COVID-19.


Coronavirus Knowledge Graph: A Case Study

Chen, Chongyan, Ebeid, Islam Akef, Bu, Yi, Ding, Ying

arXiv.org Artificial Intelligence

The emergence of the novel COVID-19 pandemic has had a significant impact on global healthcare and the economy over the past few months. The virus's rapid widespread has led to a proliferation in biomedical research addressing the pandemic and its related topics. One of the essential Knowledge Discovery tools that could help the biomedical research community understand and eventually find a cure for COVID-19 are Knowledge Graphs. The CORD-19 dataset is a collection of publicly available full-text research articles that have been recently published on COVID-19 and coronavirus topics. Here, we use several Machine Learning, Deep Learning, and Knowledge Graph construction and mining techniques to formalize and extract insights from the PubMed dataset and the CORD-19 dataset to identify COVID-19 related experts and bio-entities. Besides, we suggest possible techniques to predict related diseases, drug candidates, gene, gene mutations, and related compounds as part of a systematic effort to apply Knowledge Discovery methods to help biomedical researchers tackle the pandemic.


COVID-19 Datasets Bring AI Experts, Life Sciences Researchers Together For A Cure - AI Trends

#artificialintelligence

All of the Bio-IT community is eager to contribute to plans for treatments, diagnostics and vaccines for SARS-CoV-2 and the resulting disease, COVID-19. Companies are donating consulting services, compute resources, tools for clinical trials, and so much more. But the biggest donations might be the sheer volume of data being pooled for researchers to mine for answers. On March 16, the Allen Institute for AI (AI2), Chan Zuckerberg Initiative (CZI), Georgetown University's Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) released the COVID-19 Open Research Dataset (CORD-19). The dataset, accessible through the Allen Institute for AI's Semantic Scholar platform, includes scholarly literature about COVID-19, SARS-CoV-2, and the coronavirus group.