Ontologies
Scaling Knowledge Graphs for Automating AI of Digital Twins
Ploennigs, Joern, Semertzidis, Konstantinos, Lorenzi, Fabio, Mihindukulasooriya, Nandana
Digital Twins are digital representations of systems in the Internet of Things (IoT) that are often based on AI models that are trained on data from those systems. Semantic models are used increasingly to link these datasets from different stages of the IoT systems life-cycle together and to automatically configure the AI modelling pipelines. This combination of semantic models with AI pipelines running on external datasets raises unique challenges particular if rolled out at scale. Within this paper we will discuss the unique requirements of applying semantic graphs to automate Digital Twins in different practical use cases. We will introduce the benchmark dataset DTBM that reflects these characteristics and look into the scaling challenges of different knowledge graph technologies. Based on these insights we will propose a reference architecture that is in-use in multiple products in IBM and derive lessons learned for scaling knowledge graphs for configuring AI models for Digital Twins.
ProVe: A Pipeline for Automated Provenance Verification of Knowledge Graphs against Textual Sources
Amaral, Gabriel, Rodrigues, Odinaldo, Simperl, Elena
A Knowledge Graph (KG) is a type of knowledge base that stores information in the form of semantic triples formed by a subject, a predicate, and an object. KGs represent both real and abstract entities internally as labelled and uniquely identifiable entities, such as The Moon or Happiness, and can amass information from a multitude of domains and sources by connecting such entities amongst themselves or to literals through relationships, coded via uniquely identified predicates. KGs serve as sources of both human and machine-readable semantically structured data for various crucial applications in the modern web landscape, such as Wikipedia infoboxes, search engines results, voice-activated assistants, and information gathering projects [30]. Developed and maintained by ontology experts, data curators, and even anonymous volunteers, KGs have massively grown in size and adoption in the last decade, mainly as secondary sources of information. This means not storing new information, but taking it from authoritative and reliable sources which are explicitly referenced. As such, KGs depend on well-documented and verifiable provenance to ensure they are regarded as trustworthy and usable [56]. Processes to assess and assure the quality of information provenance are thus crucial to KGs, especially measuring and maintaining verifiability, i.e. the degree to which consumers of KG triples can attest these are truly supported by their sources [56]. However, such processes are currently performed mostly manually, which does not scale with size. Manually ensuring high verifiability on vital KGs such as Wikidata and DBpedia is prohibitive due to their sheer size.
Ontology Development is Consensus Creation, Not (Merely) Representation
Neuhaus, Fabian, Hastings, Janna
However, working ontologists are often surprised by how challenging and slow it can be to develop ontologies. Here, with a particular emphasis on the sorts of ontologies that are content-heavy and intended to be shared across a community of users (reference ontologies), we propose that a significant and heretofore under-emphasised contributor of challenges during ontology development is the need to create, or bring about, consensus in the face of disagreement. For this reason reference ontology development cannot be automated, at least within the limitations of existing AI approaches. Further, for the same reason ontologists are required to have specific social-negotiating skills which are currently lacking in most technical curricula.
BioLORD: Learning Ontological Representations from Definitions (for Biomedical Concepts and their Textual Descriptions)
Remy, Franรงois, Demuynck, Kris, Demeester, Thomas
This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).
Deep Bidirectional Language-Knowledge Graph Pretraining
Yasunaga, Michihiro, Bosselut, Antoine, Ren, Hongyu, Zhang, Xikun, Manning, Christopher D, Liang, Percy, Leskovec, Jure
Pretraining a language model (LM) on text has been shown to help various downstream NLP tasks. Recent works show that a knowledge graph (KG) can complement text data, offering structured background knowledge that provides a useful scaffold for reasoning. However, these works are not pretrained to learn a deep fusion of the two modalities at scale, limiting the potential to acquire fully joint representations of text and KG. Here we propose DRAGON (Deep Bidirectional Language-Knowledge Graph Pretraining), a self-supervised approach to pretraining a deeply joint language-knowledge foundation model from text and KG at scale. Specifically, our model takes pairs of text segments and relevant KG subgraphs as input and bidirectionally fuses information from both modalities. We pretrain this model by unifying two self-supervised reasoning tasks, masked language modeling and KG link prediction. DRAGON outperforms existing LM and LM+KG models on diverse downstream tasks including question answering across general and biomedical domains, with +5% absolute gain on average. In particular, DRAGON achieves notable performance on complex reasoning about language and knowledge (+10% on questions involving long contexts or multi-step reasoning) and low-resource QA (+8% on OBQA and RiddleSense), and new state-of-the-art results on various BioNLP tasks. Our code and trained models are available at https://github.com/michiyasunaga/dragon.
M2D2: A Massively Multi-domain Language Modeling Dataset
Reid, Machel, Zhong, Victor, Gururangan, Suchin, Zettlemoyer, Luke
We present M2D2, a fine-grained, massively multi-domain corpus for studying domain adaptation in language models (LMs). M2D2 consists of 8.5B tokens and spans 145 domains extracted from Wikipedia and Semantic Scholar. Using ontologies derived from Wikipedia and ArXiv categories, we organize the domains in each data source into 22 groups. This two-level hierarchy enables the study of relationships between domains and their effects on in- and out-of-domain performance after adaptation. We also present a number of insights into the nature of effective domain adaptation in LMs, as examples of the new types of studies M2D2 enables. To improve in-domain performance, we show the benefits of adapting the LM along a domain hierarchy; adapting to smaller amounts of fine-grained domain-specific data can lead to larger in-domain performance gains than larger amounts of weakly relevant data. We further demonstrate a trade-off between in-domain specialization and out-of-domain generalization within and across ontologies, as well as a strong correlation between out-of-domain performance and lexical overlap between domains.
On the Explainability of Natural Language Processing Deep Models
Zini, Julia El, Awad, Mariette
While there has been a recent explosion of work on ExplainableAI ExAI on deep models that operate on imagery and tabular data, textual datasets present new challenges to the ExAI community. Such challenges can be attributed to the lack of input structure in textual data, the use of word embeddings that add to the opacity of the models and the difficulty of the visualization of the inner workings of deep models when they are trained on textual data. Lately, methods have been developed to address the aforementioned challenges and present satisfactory explanations on Natural Language Processing (NLP) models. However, such methods are yet to be studied in a comprehensive framework where common challenges are properly stated and rigorous evaluation practices and metrics are proposed. Motivated to democratize ExAI methods in the NLP field, we present in this work a survey that studies model-agnostic as well as model-specific explainability methods on NLP models. Such methods can either develop inherently interpretable NLP models or operate on pre-trained models in a post-hoc manner. We make this distinction and we further decompose the methods into three categories according to what they explain: (1) word embeddings (input-level), (2) inner workings of NLP models (processing-level) and (3) models' decisions (output-level). We also detail the different evaluation approaches interpretability methods in the NLP field. Finally, we present a case-study on the well-known neural machine translation in an appendix and we propose promising future research directions for ExAI in the NLP field.
Applying FrameNet to Chinese(Poetry)
FrameNet( Fillmore and Baker [2009] ) is well-known for its wide use for knowledge representation in the form of inheritance-based ontologies and lexica( Trott et al. [2020] ). Although FrameNet is usually applied to languages like English, Spanish and Italian, there are still plenty of FrameNet data sets available for other languages like Chinese, which differs significantly from those languages based on Latin alphabets. In this paper, the translation from ancient Chinese Poetry to modern Chinese will be first conducted to further apply the Chinese FrameNet(CFN, provided by Shanxi University). Afterwards, the translation from modern Chinese will be conducted as well for the comparison between the applications of CFN and English FrameNet. Finally, the overall comparison will be draw between CFN to modern Chinese and English FrameNet.
Diversity-aware social robots meet people: beyond context-aware embodied AI
Recchiuto, Carmine, Sgorbissa, Antonio
Carmine Recchiuto, Antonio Sgorbissa Introduction Mayra is a 34-year-old woman from Sri Lanka who arrived in Genoa in 2020, just before the COVID-19 pandemic. Mayra spends her days taking care of her three children and doing housework. Due to the lockdown measures, she had few opportunities to develop relationships with Italian people, so her Italian has remained very basic. The situation did not improve until December 2021 because finding a job was challenging due to the remaining COVID restriction. In January 2022, her husband bought a small robot, and Mayra called it "Dhvija."
Measuring Network Resilience via Geospatial Knowledge Graph: a Case Study of the US Multi-Commodity Flow Network
Rao, Jinmeng, Gao, Song, Miller, Michelle, Morales, Alfonso
Quantifying the resilience in the food system is important for food security issues. In this work, we present a geospatial knowledge graph (GeoKG)-based method for measuring the resilience of a multi-commodity flow network. Specifically, we develop a CFS-GeoKG ontology to describe geospatial semantics of a multi-commodity flow network comprehensively, and design resilience metrics that measure the node-level and network-level dependence of single-sourcing, distant, or non-adjacent suppliers/customers in food supply chains. We conduct a case study of the US state-level agricultural multi-commodity flow network with hierarchical commodity types. The results indicate that, by leveraging GeoKG, our method supports measuring both node-level and network-level resilience across space and over time and also helps discover concentration patterns of agricultural resources in the spatial network at different geographic scales.