CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web
Lockard, Colin, Dong, Xin Luna, Einolghozati, Arash, Shiralkar, Prashant
–arXiv.org Artificial Intelligence
The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.
arXiv.org Artificial Intelligence
Apr-12-2018
- Genre:
- Research Report (0.64)
- Industry:
- Leisure & Entertainment (1.00)
- Media > Film (1.00)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Statistical Learning (0.68)
- Natural Language
- Information Extraction (0.88)
- Text Processing (0.93)
- Representation & Reasoning > Expert Systems (0.69)
- Communications > Web (1.00)
- Data Science > Data Mining (1.00)
- Knowledge Management > Knowledge Engineering (1.00)
- Artificial Intelligence
- Information Technology