Ontologies
Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey
Fichtl, Alexander, Vladika, Juraj, Groh, Georg
Knowledge-enhanced language models (KELMs) have emerged as promising tools to bridge the gap between large-scale language models and domain-specific knowledge. KELMs can achieve higher factual accuracy and mitigate hallucinations by leveraging knowledge graphs (KGs). They are frequently combined with adapter modules to reduce the computational load and risk of catastrophic forgetting. In this paper, we conduct a systematic literature review (SLR) on adapter-based approaches to KELMs. We provide a structured overview of existing methodologies in the field through quantitative and qualitative analysis and explore the strengths and potential shortcomings of individual approaches. We show that general knowledge and domain-specific approaches have been frequently explored along with various adapter architectures and downstream tasks. We particularly focused on the popular biomedical domain, where we provided an insightful performance comparison of existing KELMs. We outline the main trends and propose promising future directions.
F -- A Model of Events based on the Foundational Ontology DOLCE+DnS Ultralite
Scherp, Ansgar, Franz, Thomas, Saathoff, Carsten, Staab, Steffen
The lack of a formal model of events hinders interoperability in distributed event-based systems. In this paper, we present a formal model of events, called Event-Model-F. The model is based on the foundational ontology DOLCE+DnS Ultralite (DUL) and provides comprehensive support to represent time and space, objects and persons, as well as mereological, causal, and correlative relationships between events. In addition, the Event-Model-F provides a flexible means for event composition, modeling event causality and event correlation, and representing different interpretations of the same event. The Event-Model-F is developed following the pattern-oriented approach of DUL, is modularized in different ontologies, and can be easily extended by domain specific ontologies.
OM4OV: Leveraging Ontology Matching for Ontology Versioning
Qiang, Zhangcheng, Taylor, Kerry, Wang, Weiqing
Due to the dynamic nature of the semantic web, ontology version control is required to capture time-varying information, most importantly for widely-used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component for efficient ontology management, the growing size of ontologies and accumulating errors caused by manual labour overwhelm current OV approaches. In this paper, we propose yet another approach to performing OV using existing ontology matching (OM) techniques and systems. We introduce a unified OM4OV pipeline. From an OM perspective, we reconstruct a new task formulation, measurement, and testbed for OV tasks. Reusing the prior alignment(s) from OM, we propose a pipeline optimisation method called cross-reference (CR) mechanism to improve overall OV performance. We experimentally validate the OM4OV pipeline and the cross-reference mechanism in modified Ontology Alignment Evaluation Initiative (OAEI) datasets. We also discuss the insights on OM used for OV tasks, where some false mappings detected by OV systems are not actually false.
How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching?
Qiang, Zhangcheng, Taylor, Kerry, Wang, Weiqing
The generic text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many ontology matching (OM) systems. However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on OM tasks at syntactic levels. Our experiments on 8 Ontology Alignment Evaluation Initiative (OAEI) track repositories with 49 distinct alignments indicate: (1) Tokenisation and Normalisation are currently more effective than Stop Words Removal and Stemming/Lemmatisation; and (2) The selection of Lemmatisation and Stemming is task-specific. We recommend standalone Lemmatisation or Stemming with post-hoc corrections. We find that (3) Porter Stemmer and Snowball Stemmer perform better than Lancaster Stemmer; and that (4) Part-of-Speech (POS) Tagging does not help Lemmatisation. To repair less effective Stop Words Removal and Stemming/Lemmatisation used in OM tasks, we propose a novel context-based pipeline repair approach that significantly improves matching correctness and overall matching performance. We also discuss the use of text preprocessing pipeline in the new era of large language models (LLMs).
Class Order Disorder in Wikidata and First Fixes
Patel-Schneider, Peter F., Doğan, Ege Atacan
Wikidata has a large ontology with classes at several orders. The Wikidata ontology has long been known to have violations of class order and information related to class order that appears suspect. SPARQL queries were evaluated against Wikidata to determine the prevalence of several kinds of violations and suspect information and the results analyzed. Some changes were manually made to Wikidata to remove some of these results and the queries rerun, showing the effect of the changes. Suggestions are provided on how the problems uncovered might be addressed, either though better tooling or involvement of the Wikidata community.
IRSKG: Unified Intrusion Response System Knowledge Graph Ontology for Cyber Defense
Panigrahi, Damodar, Mitra, Shaswata, Neupane, Subash, Mittal, Sudip, Blakely, Benjamin A.
Cyberattacks are becoming increasingly difficult to detect and prevent due to their sophistication. In response, Autonomous Intelligent Cyber-defense Agents (AICAs) are emerging as crucial solutions. One prominent AICA agent is the Intrusion Response System (IRS), which is critical for mitigating threats after detection. IRS uses several Tactics, Techniques, and Procedures (TTPs) to mitigate attacks and restore the infrastructure to normal operations. Continuous monitoring of the enterprise infrastructure is an essential TTP the IRS uses. However, each system serves different purposes to meet operational needs. Integrating these disparate sources for continuous monitoring increases pre-processing complexity and limits automation, eventually prolonging critical response time for attackers to exploit. We propose a unified IRS Knowledge Graph ontology (IRSKG) that streamlines the onboarding of new enterprise systems as a source for the AICAs. Our ontology can capture system monitoring logs and supplemental data, such as a rules repository containing the administrator-defined policies to dictate the IRS responses. Besides, our ontology permits us to incorporate dynamic changes to adapt to the evolving cyber-threat landscape. This robust yet concise design allows machine learning models to train effectively and recover a compromised system to its desired state autonomously with explainability.
Ontology-Constrained Generation of Domain-Specific Clinical Summaries
Large Language Models (LLMs) offer promising solutions for text summarization. However, some domains require specific information to be available in the summaries. Generating these domain-adapted summaries is still an open challenge. Similarly, hallucinations in generated content is a major drawback of current approaches, preventing their deployment. This study proposes a novel approach that leverages ontologies to create domain-adapted summaries both structured and unstructured. We employ an ontology-guided constrained decoding process to reduce hallucinations while improving relevance. When applied to the medical domain, our method shows potential in summarizing Electronic Health Records (EHRs) across different specialties, allowing doctors to focus on the most relevant information to their domain. Evaluation on the MIMIC-III dataset demonstrates improvements in generating domain-adapted summaries of clinical notes and hallucination reduction.
Accelerating Knowledge Graph and Ontology Engineering with Large Language Models
Shimizu, Cogan, Hitzler, Pascal
We gratefully acknowledge support from the Simons Foundation and member institutions. Our automated source to PDF conversion system has failed to produce PDF for the paper: 2411.09601 . Return to the abstract for an alternative link to the source, or to find an email address to contact the author. For help regarding the automated source to PDF system, please contact help@arxiv.org
SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark
Sivasubramaniam, Sithursan, Osei-Akoto, Cedric, Zhang, Yi, Stockinger, Kurt, Fuerst, Jonathan
Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs. Our evaluation sheds light on the trade-offs between database models and query languages for different ICL strategies and LLMs. Last, SM3-Text-to-Query is easily extendable to additional query languages or real, standard-based patient databases.
A taxonomy of explanations to support Explainability-by-Design
Tsakalakis, Niko, Stalla-Bourdillon, Sophie, Huynh, Trung Dong, Moreau, Luc
As automated decision-making solutions are increasingly applied to all aspects of everyday life, capabilities to generate meaningful explanations for a variety of stakeholders (i.e., decision-makers, recipients of decisions, auditors, regulators...) become crucial. In this paper, we present a taxonomy of explanations that was developed as part of a holistic 'Explainability-by-Design' approach for the purposes of the project PLEAD. The taxonomy was built with a view to produce explanations for a wide range of requirements stemming from a variety of regulatory frameworks or policies set at the organizational level either to translate high-level compliance requirements or to meet business needs. The taxonomy comprises nine dimensions. It is used as a stand-alone classifier of explanations conceived as detective controls, in order to aid supportive automated compliance strategies. A machinereadable format of the taxonomy is provided in the form of a light ontology and the benefits of starting the Explainability-by-Design journey with such a taxonomy are demonstrated through a series of examples.