CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

Lockard, Colin, Dong, Xin Luna, Einolghozati, Arash, Shiralkar, Prashant

Apr-12-2018–arXiv.org Artificial Intelligence

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.

data mining, knowledge management, machine learning, (22 more...)

arXiv.org Artificial Intelligence

Apr-12-2018

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Industry:
- Media > Film (1.00)
- Leisure & Entertainment (1.00)

Technology:
- Information Technology
  - Knowledge Management > Knowledge Engineering (1.00)
  - Data Science > Data Mining (1.00)
  - Communications > Web (1.00)
  - Artificial Intelligence
    - Representation & Reasoning > Expert Systems (0.69)
    - Natural Language
      - Text Processing (0.93)
      - Information Extraction (0.88)
    - Machine Learning
      - Statistical Learning (0.68)
      - Inductive Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found