WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Šuba, Dávid, Šuppa, Marek, Kubík, Jozef, Hamerlik, Endre, Takáč, Martin

Apr-8-2023–arXiv.org Artificial Intelligence

Named Entity Recognition (NER) is a lower-level In this paper we focus on Slovak, a language Natural Language Processing (NLP) task in which of the Indo-European family, spoken by 5 million the aim is to both identify and classify named entity native speakers, which is still missing a manually expressions in text into a pre-defined set of annotated NER dataset of substantial size. To fill semantic types, such as Location, Organization or this gap, we propose the following contributions: Person (Goyal et al., 2018). It is a key component of many downstream NLP tasks, ranging from information We introduce a novel, manually annotated extraction, machine translation, question NER dataset called WikiGoldSK built by annotating answering to entity linking and co-reference resolution, articles sampled from Slovak Wikipedia among others. Since its introduction at and labeled with four entity classes. MUC-6 (Grishman and Sundheim, 1996), the task We evaluate a selection of multilingual NER has been studied extensively, usually as a form of baseline models on the presented dataset to token classification. In recent years, the advent compare its quality with that of existing silverstandard of pre-trained language models (PLMs) combined Slovak NER datasets.

artificial intelligence, natural language, text processing, (14 more...)

arXiv.org Artificial Intelligence

Apr-8-2023

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)
- Europe (0.46)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found