ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

May-15-2024–arXiv.org Artificial Intelligence

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.

experiment, paraname, wikidata, (15 more...)

arXiv.org Artificial Intelligence

May-15-2024

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - Bahia (0.04)
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Canada (0.04)
  - United States
    - New York (0.04)
    - North Carolina (0.04)
    - California (0.04)
    - New Mexico > Santa Fe County
      - Santa Fe (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Colorado > Denver County
      - Denver (0.04)
- Europe
  - United Kingdom (0.04)
  - Spain (0.04)
  - Slovenia (0.04)
  - Germany > Berlin (0.04)
  - Bulgaria (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
- Asia
  - China (0.04)
  - Kazakhstan (0.04)
  - Middle East
    - Republic of Türkiye (0.04)
    - Israel (0.04)
  - Japan > Kyūshū & Okinawa
    - Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)

Genre:
- Research Report
  - New Finding (0.47)
  - Experimental Study (0.46)

Industry:
- Government > Regional Government > North America Government > United States Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning > Neural Networks (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found