Ontologies
The sameAs Problem: A Survey on Identity Management in the Web of Data
Raad, Joe, Pernelle, Nathalie, Saïs, Fatiha, Beek, Wouter, van Harmelen, Frank
In a decentralised knowledge representation system such as the W eb of Data, it is common and indeed desirable for different knowledge graphs to overlap. Whenever multiple names are used to denote the same thing, owl:sameAs statements are needed in order to link the data and foster reuse. Whilst the deductive value of such identity statements can be extremely useful in enhancing various knowledge-based systems, incorrect use of identity can have wide-ranging effects in a global knowledge space like the W eb of Data. With several works already proven that identity in the W eb is broken, this survey investigates the current state of this "sameAs problem". An open discussion highlights the main weaknesses suffered by solutions in the literature, and draws open challenges to be faced in the future.
Enhancing magic sets with an application to ontological reasoning
Alviano, Mario, Leone, Nicola, Veltri, Pierfrancesco, Zangari, Jessica
Magic sets are a Datalog to Datalog rewriting technique to optimize query answering. The rewritten program focuses on a portion of the stable model(s) of the input program which is sufficient to answer the given query. However, the rewriting may introduce new recursive definitions, which can involve even negation and aggregations, and may slow down program evaluation. This paper enhances the magic set technique by preventing the creation of (new) recursive definitions in the rewritten program. It turns out that the new version of magic sets is closed for Datalog programs with stratified negation and aggregations, which is very convenient to obtain efficient computation of the stable model of the rewritten program. Moreover, the rewritten program is further optimized by the elimination of subsumed rules and by the efficient handling of the cases where binding propagation is lost. The research was stimulated by a challenge on the exploitation of Datalog/\textsc{dlv} for efficient reasoning on large ontologies. All proposed techniques have been hence implemented in the \textsc{dlv} system, and tested for ontological reasoning, confirming their effectiveness. Under consideration for publication in Theory and Practice of Logic Programming.
Linked Crunchbase: A Linked Data API and RDF Data Set About Innovative Companies
Crunchbase is an online platform collecting information about startups and technology companies, including attributes and relations of companies, people, and investments. Data contained in Crunchbase is, to a large extent, not available elsewhere, making Crunchbase to a unique data source. In this paper, we present how to bring Crunchbase to the Web of Data so that its data can be used in the machine-readable RDF format by anyone on the Web. First, we give insights into how we developed and hosted a Linked Data API for Crunchbase and how sameAs links to other data sources are integrated. Then, we present our method for crawling RDF data based on this API to build a custom Crunchbase RDF knowledge graph. We created an RDF data set with over 347 million triples, including 781k people, 659k organizations, and 343k investments. Our Crunchbase Linked Data API is available online at http://linked-crunchbase.org.
CLaRO: a Data-driven CNL for Specifying Competency Questions
Keet, C. Maria, Mahlaza, Zola, Antia, Mary-Jane
Competency Questions (CQs) for an ontology and similar artefacts aim to provide insights into the contents of an ontology and to demarcate its scope. The absence of a controlled natural language, tooling and automation to support the authoring of CQs has hampered their effective use in ontology development and evaluation. The few question templates that exists are based on informal analyses of a small number of CQs and have limited coverage of question types and sentence constructions. We aim to fill this gap by proposing a template-based CNL to author CQs, called CLaRO. For its design, we exploited a new dataset of 234 CQs that had been processed automatically into 106 patterns, which we analysed and used to design a template-based CNL, with an additional CNL model and XML serialisation. The CNL was evaluated with a subset of questions from the original dataset and with two sets of newly sourced CQs. The coverage of CLaRO, with its 93 main templates and 41 linguistic variants, is about 90% for unseen questions. CLaRO has the potential to facilitate streamlining formalising ontology content requirements and, given that about one third of the competency questions in the test sets turned out to be invalid questions, assist in writing good questions.
Making Study Populations Visible through Knowledge Graphs
Chari, Shruthi, Qi, Miao, Agu, Nkcheniyere N., Seneviratne, Oshani, McCusker, James P., Bennett, Kristin P., Das, Amar K., McGuinness, Deborah L.
Treatment recommendations within Clinical Practice Guidelines (CPGs) are largely based on findings from clinical trials and case studies, referred to here as research studies, that are often based on highly selective clinical populations, referred to here as study cohorts. When medical practitioners apply CPG recommendations, they need to understand how well their patient population matches the characteristics of those in the study cohort, and thus are confronted with the challenges of locating the study cohort information and making an analytic comparison. To address these challenges, we develop an ontology-enabled prototype system, which exposes the population descriptions in research studies in a declarative manner, with the ultimate goal of allowing medical practitioners to better understand the applicability and generalizability of treatment recommendations. We build a Study Cohort Ontology (SCO) to encode the vocabulary of study population descriptions, that are often reported in the first table in the published work, thus they are often referred to as Table 1. We leverage the well-used Semanticscience Integrated Ontology (SIO) for defining property associations between classes. Further, we model the key components of Table 1s, i.e., collections of study subjects, subject characteristics, and statistical measures in RDF knowledge graphs. We design scenarios for medical practitioners to perform population analysis, and generate cohort similarity visualizations to determine the applicability of a study population to the clinical population of interest. Our semantic approach to make study populations visible, by standardized representations of Table 1s, allows users to quickly derive clinically relevant inferences about study populations.
Rule Applicability on RDF Triplestore Schemas
Pareti, Paolo, Konstantinidis, George, Norman, Timothy J., Şensoy, Murat
Rule-based systems play a critical role in health and safety, where policies created by experts are usually formalised as rules. When dealing with increasingly large and dynamic sources of data, as in the case of Internet of Things (IoT) applications, it becomes important not only to efficiently apply rules, but also to reason about their applicability on datasets confined by a certain schema. In this paper we define the notion of a triplestore schema which models a set of RDF graphs. Given a set of rules and such a schema as input we propose a method to determine rule applicability and produce output schemas. Output schemas model the graphs that would be obtained by running the rules on the graph models of the input schema. We present two approaches: one based on computing a canonical (critical) instance of the schema, and a novel approach based on query rewriting. We provide theoretical, complexity and evaluation results that show the superior efficiency of our rewriting approach.
Knowledge Graph Embedding for Ecotoxicological Effect Prediction
Myklebust, Erik B., Jimenez-Ruiz, Ernesto, Chen, Jiaoyan, Wolf, Raoul, Tollefsen, Knut Erik
Exploring the effects a chemical compound has on a species takes a considerable experimental effort. Appropriate methods for estimating and suggesting new effects can dramatically reduce the work needed to be done by a laboratory. In this paper we explore the suitability of using a knowledge graph embedding approach for ecotoxicological effect prediction. A knowledge graph has been constructed from publicly available data sets, including a species taxonomy and chemical classification and similarity. The publicly available effect data is integrated to the knowledge graph using ontology alignment techniques. Our experimental results show that the knowledge graph based approach improves the selected baselines.
Formalized Conceptual Spaces with a Geometric Representation of Correlations
Bechberger, Lucas, Kühnberger, Kai-Uwe
The highly influential framework of conceptual spaces provides a geometric way of representing knowledge. Instances are represented by points in a similarity space and concepts are represented by convex regions in this space. After pointing out a problem with the convexity requirement, we propose a formalization of conceptual spaces based on fuzzy star-shaped sets. Our formalization uses a parametric definition of concepts and extends the original framework by adding means to represent correlations between different domains in a geometric way. Moreover, we define various operations for our formalization, both for creating new concepts from old ones and for measuring relations between concepts. We present an illustrative toy-example and sketch a research project on concept formation that is based on both our formalization and its implementation.
Uncovering the Semantics of Wikipedia Categories
Heist, Nicolas, Paulheim, Heiko
Two of the most prominent public knowledge graphs, DBpedia [16] and YAGO [18], build rich taxonomies using Wikipedia's infoboxes and category graph, respectively. They describe more than five million entities and contain multiple hundred millions of triples [27]. When it comes to relation assertions (RAs), however, we observe - even for basic properties - a rather low coverage: More than 50% of the 1.35 million persons in DBpedia have no birthplace assigned; even more than 80% of birthplaces are missing in YAGO. At the same time, type assertions (TAs) are not present as well for many instances - for example, there are about half a million persons in DBpedia not explicitly typed as such [23]. Missing knowledge in Wikipedia-based knowledge graphs can be attributed to absent information in Wikipedia, but also to the extraction procedures of knowledge graphs. DBpedia uses infobox mappings to extract RAs for individual instances, but it does not explicate any information implicitly encoded in categories. YAGO uses manually defined patterns to assign RAs to entities of matching categories. For example, they extract a person's year of birth by
Canonicalizing Knowledge Base Literals
Chen, Jiaoyan, Jimenez-Ruiz, Ernesto, Horrocks, Ian
Ontology-based knowledge bases (KBs) like DBpedia are very valuable resources, but their usefulness and usability is limited by various quality issues. One such issue is the use of string literals instead of semantically typed entities. In this paper we study the automated canonicalization of such literals, i.e., replacing the literal with an existing entity from the KB or with a new entity that is typed using classes from the KB. We propose a framework that combines both reasoning and machine learning in order to predict the relevant entities and types, and we evaluate this framework against state-of-the-art baselines for both semantic typing and entity matching.