Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Alves, Diego, Thakkar, Gaurish, Amaral, Gabriel, Kuculo, Tin, Tadić, Marko

Dec-14-2022–arXiv.org Artificial Intelligence

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

corpus, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Dec-14-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia > Fairfax County
    - Fairfax (0.04)
  - California > Los Angeles County
    - Los Angeles (0.04)
- Europe
  - United Kingdom > England
    - Greater London > London (0.04)
  - Germany > Lower Saxony
    - Hanover (0.04)
  - Croatia > Zagreb County
    - Zagreb (0.04)
- Asia
  - South Korea (0.04)
  - Indonesia > Sumatra
    - Bengkulu > Bengkulu (0.04)

Genre:
- Workflow (0.51)
- Research Report (0.50)

Technology:
- Information Technology
  - Communications (1.00)
  - Artificial Intelligence
    - Natural Language > Text Processing (1.00)
    - Machine Learning > Performance Analysis
      - Accuracy (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found