Equipping machines with comprehensive knowledge of the world's entities and their relationships has been a long-standing goal of AI. Over the last decade, large-scale knowledge bases, also known as knowledge graphs, have been automatically constructed from web contents and text sources, and have become a key asset for search engines. This machine knowledge can be harnessed to semantically interpret textual phrases in news, social media and web tables, and contributes to question answering, natural language processing and data analytics. This article surveys fundamental concepts and practical methods for creating and curating large knowledge bases. It covers models and methods for discovering and canonicalizing entities and their semantic types and organizing them into clean taxonomies. On top of this, the article discusses the automatic extraction of entity-centric properties. To support the long-term life-cycle and the quality assurance of machine knowledge, the article presents methods for constructing open schemas and for knowledge curation. Case studies on academic projects and industrial knowledge graphs complement the survey of concepts and methods.
The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.
Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.
Distant supervision relation extraction (DSRE) is an efficient method to extract semantic relations on a large-scale heuristic labeling corpus. However, it usually brings in a massive noisy data. In order to alleviate this problem, many recent approaches adopt reinforcement learning (RL), which aims to select correct data autonomously before relation classification. Although these RL methods outperform conventional multi-instance learning-based methods, there are still two neglected problems: 1) the existing RL methods ignore the feedback of noisy data, 2) the reduction of training corpus exacerbates long-tail problem. In this paper, we propose a novel framework to solve the two problems mentioned above. Firstly, we design a novel reward function to obtain feedback from both correct and noisy data. In addition, we use implicit relations information to improve RL. Secondly, we propose the hierarchical memory extractor (HME), which utilizes the gating mechanism to share the semantics from correlative instances between data-rich and data-poor classes. Moreover, we define a hierarchical weighted ranking loss function to implement top-down search processing. Extensive experiments conducted on the widely used NYT dataset show significant improvement over state-of-the-art baseline methods.
Entity alignment (EA) is to discover equivalent entities in knowledge graphs (KGs), which bridges heterogeneous sources of information and facilitates the integration of knowledge. Existing EA solutions mainly rely on structural information to align entities, typically through KG embedding. Nonetheless, in real-life KGs, only a few entities are densely connected to others, and the rest majority possess rather sparse neighborhood structure. We refer to the latter as long-tail entities, and observe that such phenomenon arguably limits the use of structural information for EA. To mitigate the issue, we revisit and investigate into the conventional EA pipeline in pursuit of elegant performance. For pre-alignment, we propose to amplify long-tail entities, which are of relatively weak structural information, with entity name information that is generally available (but overlooked) in the form of concatenated power mean word embeddings. For alignment, under a novel complementary framework of consolidating structural and name signals, we identify entity's degree as important guidance to effectively fuse two different sources of information. To this end, a degree-aware co-attention network is conceived, which dynamically adjusts the significance of features in a degree-aware manner. For post-alignment, we propose to complement original KGs with facts from their counterparts by using confident EA results as anchors via iterative training. Comprehensive experimental evaluations validate the superiority of our proposed techniques.