WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Šuba, Dávid, Šuppa, Marek, Kubík, Jozef, Hamerlik, Endre, Takáč, Martin

arXiv.org Artificial Intelligence 

Named Entity Recognition (NER) is a lower-level In this paper we focus on Slovak, a language Natural Language Processing (NLP) task in which of the Indo-European family, spoken by 5 million the aim is to both identify and classify named entity native speakers, which is still missing a manually expressions in text into a pre-defined set of annotated NER dataset of substantial size. To fill semantic types, such as Location, Organization or this gap, we propose the following contributions: Person (Goyal et al., 2018). It is a key component of many downstream NLP tasks, ranging from information We introduce a novel, manually annotated extraction, machine translation, question NER dataset called WikiGoldSK built by annotating answering to entity linking and co-reference resolution, articles sampled from Slovak Wikipedia among others. Since its introduction at and labeled with four entity classes. MUC-6 (Grishman and Sundheim, 1996), the task We evaluate a selection of multilingual NER has been studied extensively, usually as a form of baseline models on the presented dataset to token classification. In recent years, the advent compare its quality with that of existing silverstandard of pre-trained language models (PLMs) combined Slovak NER datasets.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found