MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition
Malmasi, Shervin, Fang, Anjie, Fetahu, Besnik, Kar, Sudipta, Rokhlenko, Oleg
–arXiv.org Artificial Intelligence
We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%), highlighting the difficulty of our data. GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%). MultiCoNER poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems. MultiCoNER is publicly available at https://registry.opendata.aws/multiconer/ and we hope that this resource will help advance research in various aspects of NER.
arXiv.org Artificial Intelligence
Aug-30-2022
- Country:
- North America > United States (0.46)
- Genre:
- Research Report (1.00)
- Industry:
- Energy > Oil & Gas
- Midstream (0.46)
- Leisure & Entertainment (0.66)
- Materials > Chemicals
- Commodity Chemicals > Petrochemicals
- LNG (0.46)
- Industrial Gases > Liquified Gas (0.46)
- Commodity Chemicals > Petrochemicals
- Media (0.48)
- Energy > Oil & Gas
- Technology: