ParaNames: A Massively Multilingual Entity Name Corpus
Sälevä, Jonne, Lignos, Constantine
–arXiv.org Artificial Intelligence
We introduce ParaNames, a multilingual parallel name resource consisting of 118 million names spanning across 400 languages. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English. Our resource is released under a Creative Commons license (CC BY 4.0) at https://github.com/bltlab/paranames.
arXiv.org Artificial Intelligence
Jul-12-2022
- Country:
- Asia
- China (0.04)
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Kazakhstan (0.04)
- Middle East > Republic of Türkiye (0.04)
- Europe
- Bulgaria (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Slovenia (0.04)
- United Kingdom (0.04)
- North America
- Canada (0.04)
- United States
- California (0.04)
- Colorado > Denver County
- Denver (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- New York (0.04)
- North Carolina (0.04)
- Oceania > Australia
- South America > Brazil
- Bahia (0.04)
- Asia
- Genre:
- Research Report
- Experimental Study (0.46)
- New Finding (0.46)
- Research Report
- Industry:
- Technology: