Ontologies
Infusing clinical knowledge into tokenisers for language models
Hasan, Abul, Wu, Jinge, Nguyen, Quang Ngoc, Andres, Salomé, Guellil, Imane, Zhang, Huayu, Casey, Arlene, Alex, Beatrice, Guthrie, Bruce, Wu, Honghan
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.
BTS: Building Timeseries Dataset: Empowering Large-Scale Building Analytics
Prabowo, Arian, Lin, Xiachong, Razzak, Imran, Xue, Hao, Yap, Emily W., Amos, Matthew, Salim, Flora D.
Buildings play a crucial role in human well-being, influencing occupant comfort, health, and safety. Additionally, they contribute significantly to global energy consumption, accounting for one-third of total energy usage, and carbon emissions. Optimizing building performance presents a vital opportunity to combat climate change and promote human flourishing. However, research in building analytics has been hampered by the lack of accessible, available, and comprehensive real-world datasets on multiple building operations. In this paper, we introduce the Building TimeSeries (BTS) dataset. Our dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique ontologies. Moreover, the metadata is standardized using the Brick schema. To demonstrate the utility of this dataset, we performed benchmarks on two tasks: timeseries ontology classification and zero-shot forecasting. These tasks represent an essential initial step in addressing challenges related to interoperability in building analytics.
A Step Towards a Universal Method for Modeling and Implementing Cross-Organizational Business Processes
Zeisler, Gerhard, Braunauer, Tim Tobias, Fleischmann, Albert, Singer, Robert
The widely adopted Business Process Model and Notation (BPMN) is a cornerstone of industry standards for business process modeling. However, its ambiguous execution semantics often result in inconsistent interpretations, depending on the software used for implementation. In response, the Process Specification Language (PASS) provides formally defined semantics to overcome these interpretational challenges. Despite its clear advantages, PASS has not reached the same level of industry penetration as BPMN. This feasibility study proposes using PASS as an intermediary framework to translate and execute BPMN models. It describes the development of a prototype translator that converts specific BPMN elements into a format compatible with PASS. These models are then transformed into source code and executed in a bespoke workflow environment, marking a departure from traditional BPMN implementations. Our findings suggest that integrating PASS enhances compatibility across different modeling and execution tools and offers a more robust methodology for implementing business processes across organizations. This study lays the groundwork for more accurate and unified business process model executions, potentially transforming industry standards for process modeling and execution.
Ontology Embedding: A Survey of Methods, Applications and Resources
Chen, Jiaoyan, Mashkova, Olga, Zhapa-Camacho, Fernando, Hoehndorf, Robert, He, Yuan, Horrocks, Ian
Ontologies are widely used for representing domain knowledge and meta data, playing an increasingly important role in Information Systems, the Semantic Web, Bioinformatics and many other domains. However, logical reasoning that ontologies can directly support are quite limited in learning, approximation and prediction. One straightforward solution is to integrate statistical analysis and machine learning. To this end, automatically learning vector representation for knowledge of an ontology i.e., ontology embedding has been widely investigated in recent years. Numerous papers have been published on ontology embedding, but a lack of systematic reviews hinders researchers from gaining a comprehensive understanding of this field. To bridge this gap, we write this survey paper, which first introduces different kinds of semantics of ontologies, and formally defines ontology embedding from the perspectives of both mathematics and machine learning, as well as its property of faithfulness. Based on this, it systematically categorises and analyses a relatively complete set of over 80 papers, according to the ontologies and semantics that they aim at, and their technical solutions including geometric modeling, sequence modeling and graph propagation. This survey also introduces the applications of ontology embedding in ontology engineering, machine learning augmentation and life sciences, presents a new library mOWL, and discusses the challenges and future directions.
A Document-based Knowledge Discovery with Microservices Architecture
Gidey, Habtom Kahsay, Kesseler, Mario, Stangl, Patrick, Hillmann, Peter, Karcher, Andreas
The first step towards digitalization within organizations lies in digitization - the conversion of analog data into digitally stored data. This basic step is the prerequisite for all following activities like the digitalization of processes or the servitization of products or offerings. However, digitization itself often leads to 'data-rich' but 'knowledge-poor' material. Knowledge discovery and knowledge extraction as approaches try to increase the usefulness of digitized data. In this paper, we point out the key challenges in the context of knowledge discovery and present an approach to addressing these using a microservices architecture. Our solution led to a conceptual design focusing on keyword extraction, similarity calculation of documents, database queries in natural language, and programming language independent provision of the extracted information. In addition, the conceptual design provides referential design guidelines for integrating processes and applications for semi-automatic learning, editing, and visualization of ontologies. The concept also uses a microservices architecture to address non-functional requirements, such as scalability and resilience. The evaluation of the specified requirements is performed using a demonstrator that implements the concept. Furthermore, this modern approach is used in the German patent office in an extended version.
Can Social Ontological Knowledge Representations be Measured Using Machine Learning?
Personal Social Ontology (PSO), it is proposed, is how an individual perceives the ontological properties of terms. For example, an absolute fatalist would arguably use terms that remove any form of agency from a person. Such fatalism has the impact of ontologically defining acts such as winning, victory and success in a manner that is contrary to how a non-fatalist would ontologically define them. While both the said fatalist and non-fatalist would agree on the dictionary definition of these terms, they would differ on specifically how they can be brought about. This difference between the two individuals can be induced from their usage of these terms, i.e., the co-occurrence of these terms with other terms. As such a quantification of this such co-occurrence offers an avenue to characterise the social ontological views of the speaker. In this paper we ask, what specific term co-occurrence should be measured in order to obtain a valid and reliable psychometric measure of a persons social ontology? We consider the social psychology and social neuroscience literature to arrive at a list of social concepts that can be considered principal features of personal social ontology, and then propose an NLP pipeline to capture the articulation of these terms in language.
Toward a Method to Generate Capability Ontologies from Natural Language Descriptions
da Silva, Luis Miguel Vieira, Köcher, Aljosha, Gehlhoff, Felix, Fay, Alexander
To achieve a flexible and adaptable system, capability ontologies are increasingly leveraged to describe functions in a machine-interpretable way. However, modeling such complex ontological descriptions is still a manual and error-prone task that requires a significant amount of effort and ontology expertise. This contribution presents an innovative method to automate capability ontology modeling using Large Language Models (LLMs), which have proven to be well suited for such tasks. Our approach requires only a natural language description of a capability, which is then automatically inserted into a predefined prompt using a few-shot prompting technique. After prompting an LLM, the resulting capability ontology is automatically verified through various steps in a loop with the LLM to check the overall correctness of the capability ontology. First, a syntax check is performed, then a check for contradictions, and finally a check for hallucinations and missing ontology elements. Our method greatly reduces manual effort, as only the initial natural language description and a final human review and possible correction are necessary, thereby streamlining the capability ontology generation process.
Improving Commonsense Bias Classification by Mitigating the Influence of Demographic Terms
Understanding commonsense knowledge is crucial in the field of Natural Language Processing (NLP). However, the presence of demographic terms in commonsense knowledge poses a potential risk of compromising the performance of NLP models. This study aims to investigate and propose methods for enhancing the performance and effectiveness of a commonsense polarization classifier by mitigating the influence of demographic terms. Three methods are introduced in this paper: (1) hierarchical generalization of demographic terms (2) threshold-based augmentation and (3) integration of hierarchical generalization and threshold-based augmentation methods (IHTA). The first method involves replacing demographic terms with more general ones based on a term hierarchy ontology, aiming to mitigate the influence of specific terms. To address the limited bias-related information, the second method measures the polarization of demographic terms by comparing the changes in the model's predictions when these terms are masked versus unmasked. This method augments commonsense sentences containing terms with high polarization values by replacing their predicates with synonyms generated by ChatGPT. The third method combines the two approaches, starting with threshold-based augmentation followed by hierarchical generalization. The experiments show that the first method increases the accuracy over the baseline by 2.33%, and the second one by 0.96% over standard augmentation methods. The IHTA techniques yielded an 8.82% and 9.96% higher accuracy than threshold-based and standard augmentation methods, respectively.
Mining Frequent Structures in Conceptual Models
Fumagalli, Mattia, Sales, Tiago Prince, Barcelos, Pedro Paulo F., Micale, Giovanni, Zaytsev, Vadim, Calvanese, Diego, Guizzardi, Giancarlo
The problem of using structured methods to represent knowledge is well-known in conceptual modeling and has been studied for many years. It has been proven that adopting modeling patterns represents an effective structural method. Patterns are, indeed, generalizable recurrent structures that can be exploited as solutions to design problems. They aid in understanding and improving the process of creating models. The undeniable value of using patterns in conceptual modeling was demonstrated in several experimental studies. However, discovering patterns in conceptual models is widely recognized as a highly complex task and a systematic solution to pattern identification is currently lacking. In this paper, we propose a general approach to the problem of discovering frequent structures, as they occur in conceptual modeling languages. As proof of concept for our scientific contribution, we provide an implementation of the approach, by focusing on UML class diagrams, in particular OntoUML models. This implementation comprises an exploratory tool, which, through the combination of a frequent subgraph mining algorithm and graph manipulation techniques, can process multiple conceptual models and discover recurrent structures according to multiple criteria. The primary objective is to offer a support facility for language engineers. This can be employed to leverage both good and bad modeling practices, to evolve and maintain the conceptual modeling language, and to promote the reuse of encoded experience in designing better models with the given language.
Data Complexity in Expressive Description Logics With Path Expressions
We investigate the data complexity of the satisfiability problem for the very expressive description logic ZOIQ (a.k.a. ALCHb Self reg OIQ) over quasi-forests and establish its NP-completeness. This completes the data complexity landscape for decidable fragments of ZOIQ, and reproves known results on decidable fragments of OWL2 (SR family). Using the same technique, we establish coNEXPTIME-completeness (w.r.t. the combined complexity) of the entailment problem of rooted queries in ZIQ.