Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia
Alves, Diego, Thakkar, Gaurish, Amaral, Gabriel, Kuculo, Tin, Tadić, Marko
–arXiv.org Artificial Intelligence
With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
arXiv.org Artificial Intelligence
Dec-14-2022
- Country:
- North America > United States
- Virginia > Fairfax County
- Fairfax (0.04)
- California > Los Angeles County
- Los Angeles (0.04)
- Virginia > Fairfax County
- Europe
- United Kingdom > England
- Greater London > London (0.04)
- Germany > Lower Saxony
- Hanover (0.04)
- Croatia > Zagreb County
- Zagreb (0.04)
- United Kingdom > England
- Asia
- South Korea (0.04)
- Indonesia > Sumatra
- North America > United States
- Genre:
- Workflow (0.51)
- Research Report (0.50)
- Technology: