Early Steps Toward Web-Scale Information Extraction with LODIE
The exponential growth of the web generates an exceptional quantity of data for which automatic knowledge capture is essential. This work describes the methodology for web-scale information extraction in the linked open data information-extraction (LODIE) project and highlights results from the early experiments carried out in the initial phase of the project. LODIE aims to develop informationextraction techniques able to scale at web level and adapt to user information needs. The core idea behind LODIE is the usage of linked open data, a very large-scale information resource, as a groundbreaking solution for IE, which provides invaluable annotated data on a growing number of domains. This article has two objectives, first, describing the LODIE project as a whole and depicting its general challenges and directions; and second, describing some initial steps taken toward the general solution, focusing on a specific IE subtask, wrapper induction. Nevertheless, the current state of the art has mainly addressed tasks for which resources for training are available (for example, the TAP ontology in the paper by Etzioni and colleagues [2004]) or use generic patterns to extract generic facts (for example, Banko et al. [2007]; OpenCalais.com). The limited availability of resources for training has so far prevented the study of the generalized use of large-scale resources to port to specific user information needs. The linked open data information-extraction (LODIE) project focuses on the study of IE models and algorithms able to perform efficient user-centered web-scale learning by exploiting linked open data (LOD). In this article we will highlight the initial steps of the LODIE project, focusing on a specific IE task, wrapper induction (WI), which consists of automatically learning wrappers for uniform web pages, that is, pages from one website, usually generated with the same script and all describing the same type of entity. We show results on the WI task, exploiting linked data obtained from DBpedia as learning material.
Jan-4-2018, 09:46:01 GMT
- Technology: