He, Yongqun
Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models
Rehana, Hasin, Zheng, Jie, Yeh, Leo, Bansal, Benu, Çam, Nur Bengisu, Jemiyo, Christianah, McGregor, Brett, Özgür, Arzucan, He, Yongqun, Hur, Junguk
Motivation: An adjuvant is a chemical incorporated into vaccines that enhances their efficacy by improving the immune response. Identifying adjuvant names from cancer vaccine studies is essential for furthering research and enhancing immunotherapies. However, the manual curation from the constantly expanding biomedical literature poses significant challenges. This study explores the automated recognition of vaccine adjuvant names using Large Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) and Large Language Model Meta AI (Llama). Methods: We utilized two datasets: 97 clinical trial records from AdjuvareDB and 290 abstracts annotated with the Vaccine Adjuvant Compendium (VAC). GPT-4o and Llama 3.2 were employed in zero-shot and few-shot learning paradigms with up to four examples per prompt. Prompts explicitly targeted adjuvant names, testing the impact of contextual information such as substances or interventions. Outputs underwent automated and manual validation for accuracy and consistency. Results: GPT-4o attained 100% Precision across all situations while exhibiting notable improve in Recall and F1-scores, particularly with incorporating interventions. On the VAC dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o reached an F1-score of 81.67% for three-shot prompting with interventions, surpassing Llama-3.2-3 B's maximum F1-score of 65.62%. Conclusion: Our findings demonstrate that LLMs excel at identifying adjuvant names, including rare variations of naming representation. This study emphasizes the capability of LLMs to enhance cancer vaccine development by efficiently extracting insights. Future work aims to broaden the framework to encompass various biomedical literature and enhance model generalizability across various vaccines and adjuvants.
A Fourfold Pathogen Reference Ontology Suite
Babcock, Shane, Benson, Carter, De Colle, Giacomo, Cohen, Sydney, Diehl, Alexander D., Challa, Ram A. N. R., Huffman, Anthony, He, Yongqun, Beverley, John
Infectious diseases remain a critical global health challenge, and the integration of standardized ontologies plays a vital role in managing related data. The Infectious Disease Ontology (IDO) and its extensions, such as the Coronavirus Infectious Disease Ontology (CIDO), are essential for organizing and disseminating information related to infectious diseases. The COVID-19 pandemic highlighted the need for updating IDO and its virus-specific extensions. There is an additional need to update IDO extensions specific to bacteria, fungus, and parasite infectious diseases. We adopt the "hub and spoke" methodology to generate pathogen-specific extensions of IDO: Virus Infectious Disease Ontology (VIDO), Bacteria Infectious Disease Ontology (BIDO), Mycosis Infectious Disease Ontology (MIDO), and Parasite Infectious Disease Ontology (PIDO). The creation of pathogen-specific reference ontologies advances modularization and reusability of infectious disease data within the IDO ecosystem. Future work will focus on further refining these ontologies, creating new extensions, and developing application ontologies based on them, in line with ongoing efforts to standardize biological and biomedical terminologies for improved data sharing and analysis.
Credentials in the Occupation Ontology
Beverley, John, McGill, Robin, Smith, Sam, Zheng, Jie, De Colle, Giacomo, Wilson, Finn, Diller, Matthew, Duncan, William D., Hogan, William R., He, Yongqun
The term credential encompasses educational certificates, degrees, certifications, and government-issued licenses. An occupational credential is a verification of an individuals qualification or competence issued by a third party with relevant authority. Job seekers often leverage such credentials as evidence that desired qualifications are satisfied by their holders. Many U.S. education and workforce development organizations have recognized the importance of credentials for employment and the challenges of understanding the value of credentials. In this study, we identified and ontologically defined credential and credential-related terms at the textual and semantic levels based on the Occupation Ontology (OccO), a BFO-based ontology. Different credential types and their authorization logic are modeled. We additionally defined a high-level hierarchy of credential related terms and relations among many terms, which were initiated in concert with the Alabama Talent Triad (ATT) program, which aims to connect learners, earners, employers and education/training providers through credentials and skills. To our knowledge, our research provides for the first time systematic ontological modeling of the important domain of credentials and related contents, supporting enhanced credential data and knowledge integration in the future.
Grounding Realizable Entities
Rabenberg, Michael, Benson, Carter, Donato, Federico, He, Yongqun, Huffman, Anthony, Babcock, Shane, Beverley, John
Ontological representations of qualities, dispositions, and roles have been refined over the past decade, clarifying subtle distinctions in life science research. After articulating a widely-used characterization of these entities within the context of Basic Formal Ontology (BFO), we identify gaps in this treatment and motivate the need for supplementing the BFO characterization. By way of supplement, we propose definitions for grounding relations holding between qualities and dispositions, and dispositions and roles, illustrating our proposal by representing subtle aspects of host-pathogen interactions.
Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text
Rehana, Hasin, Çam, Nur Bengisu, Basmaci, Mert, Zheng, Jie, Jemiyo, Christianah, He, Yongqun, Özgür, Arzucan, Hur, Junguk
Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks. We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora: Learning Language in Logic (LLL) with 164 PPIs in 77 sentences, Human Protein Reference Database with 163 PPIs in 145 sentences, and Interaction Extraction Performance Assessment with 335 PPIs in 486 sentences. BERT-based models achieved the best overall performance, with BioBERT achieving the highest recall (91.95%) and F1-score (86.84%) and PubMedBERT achieving the highest precision (85.25%). Interestingly, despite not being explicitly trained for biomedical texts, GPT-4 achieved commendable performance, comparable to the top-performing BERT models. It achieved a precision of 88.37%, a recall of 85.14%, and an F1-score of 86.49% on the LLL dataset. These results suggest that GPT models can effectively detect PPIs from text data, offering promising avenues for application in biomedical literature mining. Further research could explore how these models might be fine-tuned for even more specialized tasks within the biomedical domain.
OntoPlot: A Novel Visualisation for Non-hierarchical Associations in Large Ontologies
Yang, Ying, Wybrow, Michael, Li, Yuan-Fang, Czauderna, Tobias, He, Yongqun
Ontologies are formal representations of concepts and complex relationships among them. They have been widely used to capture comprehensive domain knowledge in areas such as biology and medicine, where large and complex ontologies can contain hundreds of thousands of concepts. Especially due to the large size of ontologies, visualisation is useful for authoring, exploring and understanding their underlying data. Existing ontology visualisation tools generally focus on the hierarchical structure, giving much less emphasis to non-hierarchical associations. In this paper we present OntoPlot, a novel visualisation specifically designed to facilitate the exploration of all concept associations whilst still showing an ontology's large hierarchical structure. This hybrid visualisation combines icicle plots, visual compression techniques and interactivity, improving space-efficiency and reducing visual structural complexity. We conducted a user study with domain experts to evaluate the usability of OntoPlot, comparing it with the de facto ontology editor Prot{\'e}g{\'e}. The results confirm that OntoPlot attains our design goals for association-related tasks and is strongly favoured by domain experts.