MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition

Malmasi, Shervin, Fang, Anjie, Fetahu, Besnik, Kar, Sudipta, Rokhlenko, Oleg

Aug-30-2022–arXiv.org Artificial Intelligence

We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%), highlighting the difficulty of our data. GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%). MultiCoNER poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems. MultiCoNER is publicly available at https://registry.opendata.aws/multiconer/ and we hope that this resource will help advance research in various aspects of NER.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

Aug-30-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (1.00)

Industry:
- Leisure & Entertainment (0.66)
- Media (0.48)
- Materials > Chemicals
  - Industrial Gases > Liquified Gas (0.46)
  - Commodity Chemicals > Petrochemicals
    - LNG (0.46)
- Energy > Oil & Gas
  - Midstream (0.46)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found