Ontologies
Harmonizing Metadata of Language Resources for Enhanced Querying and Accessibility
This paper addresses the harmonization of metadata from diverse repositories of language resources (LRs). Leveraging linked data and RDF techniques, we integrate data from multiple sources into a unified model based on DCAT and META-SHARE OWL ontology. Our methodology supports text-based search, faceted browsing, and advanced SPARQL queries through Linghub, a newly developed portal. Real user queries from the Corpora Mailing List (CML) were evaluated to assess Linghub capability to satisfy actual user needs. Results indicate that while some limitations persist, many user requests can be successfully addressed. The study highlights significant metadata issues and advocates for adherence to open vocabularies and standards to enhance metadata harmonization. This initial research underscores the importance of API-based access to LRs, promoting machine usability and data subset extraction for specific purposes, paving the way for more efficient and standardized LR utilization.
Towards an Ontology of Traceable Impact Management in the Food Supply Chain
Gajderowicz, Bart, Fox, Mark S, Gao, Yongchao
The pursuit of quality improvements and accountability in the food supply chains, especially how they relate to food-related outcomes, such as hunger, has become increasingly vital, necessitating a comprehensive approach that encompasses product quality and its impact on various stakeholders and their communities. Such an approach offers numerous benefits in increasing product quality and eliminating superfluous measurements while appraising and alleviating the broader societal and environmental repercussions. A traceable impact management model (TIMM) provides an impact structure and a reporting mechanism that identifies each stakeholder's role in the total impact of food production and consumption stages. The model aims to increase traceability's utility in understanding the impact of changes on communities affected by food production and consumption, aligning with current and future government requirements, and addressing the needs of communities and consumers. This holistic approach is further supported by an ontological model that forms the logical foundation and a unified terminology. By proposing a holistic and integrated solution across multiple stakeholders, the model emphasizes quality and the extensive impact of championing accountability, sustainability, and responsible practices with global traceability. With these combined efforts, the food supply chain moves toward a global tracking and tracing process that not only ensures product quality but also addresses its impact on a broader scale, fostering accountability, sustainability, and responsible food production and consumption.
GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction
Miao, Yuwei, Guo, Yuzhi, Ma, Hehuan, Yan, Jingquan, Jiang, Feng, Liao, Rui, Huang, Junzhou
Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.
Efficient Relational Context Perception for Knowledge Graph Completion
Tu, Wenkai, Wan, Guojia, Shang, Zhengchun, Du, Bo
Knowledge Graphs (KGs) provide a structured representation of knowledge but often suffer from challenges of incompleteness. To address this, link prediction or knowledge graph completion (KGC) aims to infer missing new facts based on existing facts in KGs. Previous knowledge graph embedding models are limited in their ability to capture expressive features, especially when compared to deeper, multi-layer models. These approaches also assign a single static embedding to each entity and relation, disregarding the fact that entities and relations can exhibit different behaviors in varying graph contexts. Due to complex context over a fact triple of a KG, existing methods have to leverage complex non-linear context encoder, like transformer, to project entity and relation into low dimensional representations, resulting in high computation cost. To overcome these limitations, we propose Triple Receptance Perception (TRP) architecture to model sequential information, enabling the learning of dynamic context of entities and relations. Then we use tensor decomposition to calculate triple scores, providing robust relational decoding capabilities. This integration allows for more expressive representations. Experiments on benchmark datasets such as YAGO3-10, UMLS, FB15k, and FB13 in link prediction and triple classification tasks demonstrate that our method performs better than several state-of-the-art models, proving the effectiveness of the integration.
A Fourfold Pathogen Reference Ontology Suite
Babcock, Shane, Benson, Carter, De Colle, Giacomo, Cohen, Sydney, Diehl, Alexander D., Challa, Ram A. N. R., Huffman, Anthony, He, Yongqun, Beverley, John
Infectious diseases remain a critical global health challenge, and the integration of standardized ontologies plays a vital role in managing related data. The Infectious Disease Ontology (IDO) and its extensions, such as the Coronavirus Infectious Disease Ontology (CIDO), are essential for organizing and disseminating information related to infectious diseases. The COVID-19 pandemic highlighted the need for updating IDO and its virus-specific extensions. There is an additional need to update IDO extensions specific to bacteria, fungus, and parasite infectious diseases. We adopt the "hub and spoke" methodology to generate pathogen-specific extensions of IDO: Virus Infectious Disease Ontology (VIDO), Bacteria Infectious Disease Ontology (BIDO), Mycosis Infectious Disease Ontology (MIDO), and Parasite Infectious Disease Ontology (PIDO). The creation of pathogen-specific reference ontologies advances modularization and reusability of infectious disease data within the IDO ecosystem. Future work will focus on further refining these ontologies, creating new extensions, and developing application ontologies based on them, in line with ongoing efforts to standardize biological and biomedical terminologies for improved data sharing and analysis.
Ontology-grounded Automatic Knowledge Graph Construction by LLM under Wikidata schema
Feng, Xiaohan, Wu, Xixin, Meng, Helen
We propose an ontology-grounded approach to Knowledge Graph (KG) construction using Large Language Models (LLMs) on a knowledge base. An ontology is authored by generating Competency Questions (CQ) on knowledge base to discover knowledge scope, extracting relations from CQs, and attempt to replace equivalent relations by their counterpart in Wikidata. To ensure consistency and interpretability in the resulting KG, we ground generation of KG with the authored ontology based on extracted relations. Evaluation on benchmark datasets demonstrates competitive performance in knowledge graph construction task. Our work presents a promising direction for scalable KG construction pipeline with minimal human intervention, that yields high quality and human-interpretable KGs, which are interoperable with Wikidata semantics for potential knowledge base expansion.
RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF for Conversational QA over KGs with RAG
Roy, Rishiraj Saha, Hinze, Chris, Schlotthauer, Joel, Naderi, Farzad, Hangya, Viktor, Foltyn, Andreas, Hahn, Luzian, Kuech, Fabian
Conversational question answering (ConvQA) is a convenient means of searching over RDF knowledge graphs (KGs), where a prevalent approach is to translate natural language questions to SPARQL queries. However, SPARQL has certain shortcomings: (i) it is brittle for complex intents and conversational questions, and (ii) it is not suitable for more abstract needs. Instead, we propose a novel two-pronged system where we fuse: (i) SQL-query results over a database automatically derived from the KG, and (ii) text-search results over verbalizations of KG facts. Our pipeline supports iterative retrieval: when the results of any branch are found to be unsatisfactory, the system can automatically opt for further rounds. We put everything together in a retrieval augmented generation (RAG) setup, where an LLM generates a coherent response from accumulated search results. We demonstrate the superiority of our proposed system over several baselines on a knowledge graph of BMW automobiles.
On the Power and Limitations of Examples for Description Logic Concepts
Cate, Balder ten, Koudijs, Raoul, Ozaki, Ana
We investigate the power soltera2 is a positive example for C, and of labeled examples for describing description-logic px10 and teslaY are negative examples for C concepts. Specifically, we systematically study the In fact, as it turns out, C is the only EL-concept (up to equivalence) existence and efficient computability of finite characterisations, that fits these three labeled examples. In other words, i.e., finite sets of labeled examples these three labeled examples "uniquely characterize" C within that uniquely characterize a single concept, for a the class of all EL-concepts. This shows that the above three wide variety of description logics between EL and labeled examples are a good choice of examples. Adding any ALCQI,both without an ontology and in the presence additional examples would be redundant. Note, however, that of a DL-Lite ontology. Finite characterisations this depends on the choice of description logic. For instance, are relevant for debugging purposes, and their existence the richer concept language ALC allows for other concept is a necessary condition for exact learnability expressions such as Bicycle Contains.Basket that also fit.
Advances in Machine Learning Research Using Knowledge Graphs
Machine learning is an interdisciplinary field that studies how computers can learn and simulate human learning behaviour. By acquiring new knowledge, machine learning aims to reorganize existing knowledge structures to continuously improve its own performance. Machine learning was proposed in the mid-1950s, and over the next 30 years, related research in the field of machine learning continued to develop. Machine learning has interdisciplinary attributes and has been widely applied in the field of artificial intelligence. Zhang and Wang [2016] argue that the way to transform big data into more valuable knowledge is by applying machine learning techniques.
Apples to Apples: Establishing Comparability in Knowledge Generation Tasks Involving Users
Debruyne, Christophe, Junior, Ademar Crotti
Knowledge graph construction (KGC) from (semi-)structured data is challenging, and facilitating user involvement is an issue frequently brought up within this community. We cannot deny the progress we have made with respect to (declarative) knowledge generation languages and tools to help build such mappings. However, it is surprising that no two studies report on similar protocols. This heterogeneity does not allow for a comparison of KGC languages, techniques, and tools. This paper first analyses the various studies that report on studies involving users to identify the points of comparison. These gaps include a lack of systematic consistency in task design, participant selection, and evaluation metrics. Moreover, there needs to be a systematic way of analyzing the data and reporting the findings, which is also lacking. We thus propose and introduce a user protocol for KGC designed to address this challenge. Where possible, we draw and take elements from the literature we deem fit for such a protocol. The protocol, as such, allows for the comparison of languages and techniques for the RDF Mapping Languages core functionality, which is covered by most of the other state-of-the-art techniques and tools. We also propose how the protocol can be amended to compare extensions (of RML). This protocol provides an important step towards a more comparable evaluation of KGC user studies.