Semantic Networks
DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
In this paper, we study the problem of learning probabilistic logical rules for inductive and interpretable link prediction. Despite the importance of inductive link prediction, most previous works focused on transductive link prediction and cannot manage previously unseen entities. Moreover, they are black-box models that are not easily explainable for humans. We propose DRUM, a scalable and differentiable approach for mining first-order logical rules from knowledge graphs that resolves these problems. We motivate our method by making a connection between learning confidence scores for each rule and low-rank tensor approximation. DRUM uses bidirectional RNNs to share useful information across the tasks of learning rules for different relations.
Diversified and Adaptive Negative Sampling on Knowledge Graphs
Liu, Ran, Liu, Zhongzhou, Li, Xiaoli, Wu, Hao, Fang, Yuan
In knowledge graph embedding, aside from positive triplets (ie: facts in the knowledge graph), the negative triplets used for training also have a direct influence on the model performance. In reality, since knowledge graphs are sparse and incomplete, negative triplets often lack explicit labels, and thus they are often obtained from various sampling strategies (eg: randomly replacing an entity in a positive triplet). An ideal sampled negative triplet should be informative enough to help the model train better. However, existing methods often ignore diversity and adaptiveness in their sampling process, which harms the informativeness of negative triplets. As such, we propose a generative adversarial approach called Diversified and Adaptive Negative Sampling DANS on knowledge graphs. DANS is equipped with a two-way generator that generates more diverse negative triplets through two pathways, and an adaptive mechanism that produces more fine-grained examples by localizing the global generator for different entities and relations. On the one hand, the two-way generator increase the overall informativeness with more diverse negative examples; on the other hand, the adaptive mechanism increases the individual sample-wise informativeness with more fine-grained sampling. Finally, we evaluate the performance of DANS on three benchmark knowledge graphs to demonstrate its effectiveness through quantitative and qualitative experiments.
Representation-Enhanced Neural Knowledge Integration with Application to Large-Scale Medical Ontology Learning
Liu, Suqi, Cai, Tianxi, Li, Xiaoou
A large-scale knowledge graph enhances reproducibility in biomedical data discovery by providing a standardized, integrated framework that ensures consistent interpretation across diverse datasets. It improves generalizability by connecting data from various sources, enabling broader applicability of findings across different populations and conditions. Generating reliable knowledge graph, leveraging multi-source information from existing literature, however, is challenging especially with a large number of node sizes and heterogeneous relations. In this paper, we propose a general theoretically guaranteed statistical framework, called RENKI, to enable simultaneous learning of multiple relation types. RENKI generalizes various network models widely used in statistics and computer science. The proposed framework incorporates representation learning output into initial entity embedding of a neural network that approximates the score function for the knowledge graph and continuously trains the model to fit observed facts. We prove nonasymptotic bounds for in-sample and out-of-sample weighted MSEs in relation to the pseudo-dimension of the knowledge graph function class. Additionally, we provide pseudo-dimensions for score functions based on multilayer neural networks with ReLU activation function, in the scenarios when the embedding parameters either fixed or trainable. Finally, we complement our theoretical results with numerical studies and apply the method to learn a comprehensive medical knowledge graph combining a pretrained language model representation with knowledge graph links observed in several medical ontologies. The experiments justify our theoretical findings and demonstrate the effect of weighting in the presence of heterogeneous relations and the benefit of incorporating representation learning in nonparametric models.
Inference over Unseen Entities, Relations and Literals on Knowledge Graphs
Demir, Caglar, Kouagou, N'Dah Jean, Sharma, Arnab, Ngomo, Axel-Cyrille Ngonga
In recent years, knowledge graph embedding models have been successfully applied in the transductive setting to tackle various challenging tasks including link prediction, and query answering. Yet, the transductive setting does not allow for reasoning over unseen entities, relations, let alone numerical or non-numerical literals. Although increasing efforts are put into exploring inductive scenarios, inference over unseen entities, relations, and literals has yet to come. This limitation prohibits the existing methods from handling real-world dynamic knowledge graphs involving heterogeneous information about the world. Here, we propose a remedy to this limitation. We propose the attentive byte-pair encoding layer (BytE) to construct a triple embedding from a sequence of byte-pair encoded subword units of entities and relations. Compared to the conventional setting, BytE leads to massive feature reuse via weight tying, since it forces a knowledge graph embedding model to learn embeddings for subword units instead of entities and relations directly. Consequently, the size of the embedding matrices are not anymore bound to the unique number of entities and relations of a knowledge graph. Experimental results show that BytE improves the link prediction performance of 4 knowledge graph embedding models on datasets where the syntactic representations of triples are semantically meaningful. However, benefits of training a knowledge graph embedding model with BytE dissipate on knowledge graphs where entities and relations are represented with plain numbers or URIs. We provide an open source implementation of BytE to foster reproducible research.
Reviews: Hash Embeddings for Efficient Word Representations
This work describes an extension of the standard word hashing trick for embedding representation by using a weighted combination of several vectors indexed by different hash functions to represent each word. This can be done using a predefined dictionary or during online training. The approach has the benefit of being easy to understand and implement and greatly reduces the number of embedding parameters. The results are generally good and the approach is quite elegant. Additionally, the hash embeddings seem to act as an effective regularizer.
Reviews: Embedding Logical Queries on Knowledge Graphs
This work proposes a method for answering complex conjunctive queries against an incomplete Knowledge Base. The method relies on node embeddings, and trainable differentiable representations for intersection and neighbourhood expansion operations, which can compute a prediction in linear time relative to the edges involved. The method is evaluated on a biomedical and a reddit dataset. The originality of this work lies in posing "soft" logical queries against a "hard" graph, rather than posing "hard" queries against a "soft" graph, as is done typically. The manuscript is clear, well-written and introduces notation using examples.
Reviews: FRAGE: Frequency-Agnostic Word Representation
The core idea is adding an adversarial loss which is recently widely used in many NLP papers as mentioned in this paper submission. The authors define two frequency-based domains: major words and rare words. The proposed approach is easy to use and seems helpful in improving accuracy of several NLP tasks.
Reviews: SimplE Embedding for Link Prediction in Knowledge Graphs
The contribution is a tensor factorization method that represents each object and relation as some vector. The main novelty relative to previous approaches is that each relation r is represented by two embedding vectors: one for r, and one for r -1. The motivation is that should allow objects that appear as both heads and tails to more easily learn jointly from both roles.
Liver Cancer Knowledge Graph Construction based on dynamic entity replacement and masking strategies RoBERTa-BiLSTM-CRF model
Zhang, YiChi, Wang, HaiLing, Gao, YongBin, Hu, XiaoJun, Fan, YingFang, Fang, ZhiJun
Background: Liver cancer ranks as the fifth most common malignant tumor and the second most fatal in our country. Early diagnosis is crucial, necessitating that physicians identify liver cancer in patients at the earliest possible stage. However, the diagnostic process is complex and demanding. Physicians must analyze a broad spectrum of patient data, encompassing physical condition, symptoms, medical history, and results from various examinations and tests, recorded in both structured and unstructured medical formats. This results in a significant workload for healthcare professionals. In response, integrating knowledge graph technology to develop a liver cancer knowledge graph-assisted diagnosis and treatment system aligns with national efforts toward smart healthcare. Such a system promises to mitigate the challenges faced by physicians in diagnosing and treating liver cancer. Methods: This paper addresses the major challenges in building a knowledge graph for hepatocellular carcinoma diagnosis, such as the discrepancy between public data sources and real electronic medical records, the effective integration of which remains a key issue. The knowledge graph construction process consists of six steps: conceptual layer design, data preprocessing, entity identification, entity normalization, knowledge fusion, and graph visualization. A novel Dynamic Entity Replacement and Masking Strategy (DERM) for named entity recognition is proposed. Results: A knowledge graph for liver cancer was established, including 7 entity types such as disease, symptom, and constitution, containing 1495 entities. The recognition accuracy of the model was 93.23%, the recall was 94.69%, and the F1 score was 93.96%.
A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications
Bolleman, Jerven, Emonet, Vincent, Altenhoff, Adrian, Bairoch, Amos, Blatter, Marie-Claude, Bridge, Alan, Duvaud, Severine, Gasteiger, Elisabeth, Kuznetsov, Dmitry, Moretti, Sebastien, Michel, Pierre-Andre, Morgat, Anne, Pagni, Marco, Redaschi, Nicole, Zahn-Zabal, Monique, de Farias, Tarcisio Mendes, Sima, Ana Claudia
Background. In the last decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning, if a sufficiently large number of examples are provided and published in a common, machine-readable and standardized format across different resources. Findings. We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1000 example questions and queries, including 65 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology. Conclusions. We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services.