Goto

Collaborating Authors

 Musen, Mark A.


Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models

arXiv.org Artificial Intelligence

Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.5). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base. Introduction Data sharing, a pivotal requirement for good science that is now required by most funding agencies, continues to be a challenging prospect.


Making Metadata More FAIR Using Large Language Models

arXiv.org Artificial Intelligence

With the global increase in experimental data artifacts, harnessing them in a unified fashion leads to a major stumbling block - bad metadata. To bridge this gap, this work presents a Natural Language Processing (NLP) informed application, called FAIRMetaText, that compares metadata. Specifically, FAIRMetaText analyzes the natural language descriptions of metadata and provides a mathematical similarity measure between two terms. This measure can then be utilized for analyzing varied metadata, by suggesting terms for compliance or grouping similar terms for identification of replaceable terms. The efficacy of the algorithm is presented qualitatively and quantitatively on publicly available research artifacts and demonstrates large gains across metadata related tasks through an in-depth study of a wide variety of Large Language Models (LLMs). This software can drastically reduce the human effort in sifting through various natural language metadata while employing several experimental datasets on the same topic.


An Empirical Meta-analysis of the Life Sciences (Linked?) Open Data on the Web

arXiv.org Artificial Intelligence

While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 publicly available biomedical linked data graphs into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or mappings, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web.


WebProt\'eg\'e: A Cloud-Based Ontology Editor

arXiv.org Artificial Intelligence

We present WebProt\'eg\'e, a tool to develop ontologies represented in the Web Ontology Language (OWL). WebProt\'eg\'e is a cloud-based application that allows users to collaboratively edit OWL ontologies, and it is available for use at https://webprotege.stanford.edu. WebProt\'ege\'e currently hosts more than 68,000 OWL ontology projects and has over 50,000 user accounts. In this paper, we detail the main new features of the latest version of WebProt\'eg\'e.


The Variable Quality of Metadata About Biological Samples Used in Biomedical Experiments

arXiv.org Artificial Intelligence

We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well- known databases: BioSample---a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples---a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.


NCBO Ontology Recommender 2.0: An Enhanced Approach for Biomedical Ontology Recommendation

arXiv.org Artificial Intelligence

Biomedical researchers use ontologies to annotate their data with ontology terms, enabling better data integration and interoperability. However, the number, variety and complexity of current biomedical ontologies make it cumbersome for researchers to determine which ones to reuse for their specific needs. To overcome this problem, in 2010 the National Center for Biomedical Ontology (NCBO) released the Ontology Recommender, which is a service that receives a biomedical text corpus or a list of keywords and suggests ontologies appropriate for referencing the indicated terms. We developed a new version of the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a new recommendation approach that evaluates the relevance of an ontology to biomedical text data according to four criteria: (1) the extent to which the ontology covers the input data; (2) the acceptance of the ontology in the biomedical community; (3) the level of detail of the ontology classes that cover the input data; and (4) the specialization of the ontology to the domain of the input data. Our evaluation shows that the enhanced recommender provides higher quality suggestions than the original approach, providing better coverage of the input data, more detailed information about their concepts, increased specialization for the domain of the input data, and greater acceptance and use in the community. In addition, it provides users with more explanatory information, along with suggestions of not only individual ontologies but also groups of ontologies. It also can be customized to fit the needs of different scenarios. Ontology Recommender 2.0 combines the strengths of its predecessor with a range of adjustments and new features that improve its reliability and usefulness. Ontology Recommender 2.0 recommends over 500 biomedical ontologies from the NCBO BioPortal platform, where it is openly available.


Discovering Beaten Paths in Collaborative Ontology-Engineering Projects using Markov Chains

arXiv.org Artificial Intelligence

Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases (ICD) as a taxonomy or the National Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processing information about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increased in size. For example, the 11th revision of the ICD, which is currently under active development by the WHO contains nearly 50,000 classes representing a vast variety of different diseases and causes of death. This evolution in terms of size was accompanied by an evolution in the way ontologies are engineered. Because no single individual has the expertise to develop such large-scale ontologies, ontology-engineering projects have evolved from small-scale efforts involving just a few domain experts to large-scale projects that require effective collaboration between dozens or even hundreds of experts, practitioners and other stakeholders. Understanding how these stakeholders collaborate will enable us to improve editing environments that support such collaborations. We uncover how large ontology-engineering projects, such as the ICD in its 11th revision, unfold by analyzing usage logs of five different biomedical ontology-engineering projects of varying sizes and scopes using Markov chains. We discover intriguing interaction patterns (e.g., which properties users subsequently change) that suggest that large collaborative ontology-engineering projects are governed by a few general principles that determine and drive development. From our analysis, we identify commonalities and differences between different projects that have implications for project managers, ontology editors, developers and contributors working on collaborative ontology-engineering projects and tools in the biomedical domain.


Ontology Quality Assurance with the Crowd

AAAI Conferences

The Semantic Web has the potential to change the Web as we know it. However, the community faces a significant challenge in managing, aggregating, and curating the massive amount of data and knowledge. Human computation is only beginning to serve an essential role in the curation of these Web-based data. Ontologies, which facilitate data integration and search, serve as a central component of the Semantic Web, but they are large, complex, and typically require extensive expert curation. Furthermore, ontology-engineering tasks require more knowledge than is required  in a typical crowdsourcing-task. We have developed ontology-engineering methods that leverage the crowd. In this work, we describe our general crowdsourcing workflow. We then highlight  our work on applying this workflow to ontology verification and quality assurance. In a pilot study, this method approaches expert ability, finding the same errors that experts identified with 86% accuracy in a faster and more scalable fashion. The work provides a general framework with which to develop crowdsourcing methods for the Semantic Web. In addition, it highlights opportunities for future research in human computation and crowdsourcing.


Graph-Grammar Assistance for Automated Generation of Influence Diagrams

arXiv.org Artificial Intelligence

One of the most difficult aspects of modeling complex dilemmas in decision-analytic terms is composing a diagram of relevance relations from a set of domain concepts. Decision models in domains such as medicine, however, exhibit certain prototypical patterns that can guide the modeling process. Medical concepts can be classified according to semantic types that have characteristic positions and typical roles in an influence-diagram model. We have developed a graph-grammar production system that uses such inherent interrelationships among medical terms to facilitate the modeling of medical decisions.


Pragmatic Analysis of Crowd-Based Knowledge Production Systems with iCAT Analytics: Visualizing Changes to the ICD-11 Ontology

AAAI Conferences

While in the past taxonomic and ontological knowledge was traditionally produced by small groups of co-located experts, today the production of such knowledge has a radically different shape and form. For example, potentially thousands of health professionals, scientists, and ontology experts will collaboratively construct, evaluate and maintain the most recent version of the International Classification of Diseases (ICD-11), a large ontology of diseases and causes of deaths managed by the World Health Organization. In this work, we present a novel web-based tool — iCAT Analytics — that allows to investigate systematically crowd-based processes in knowledge-production systems. To enable such investigation, the tool supports interactive exploration of pragmatic aspects of ontology engineering such as how a given ontology evolved and the nature of changes, discussions and interactions that took place during its production process. While iCAT Analytics was motivated by ICD-11, it could potentially be applied to any crowd-based ontology-engineering project. We give an introduction to the features of iCAT Analytics and present some insights specifically for ICD-11.