MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition
Sarcinelli, João Lucas Luz Lima, Teixeira, Marina Lages Gonçalves, de Paiva, Jade Bortot, Silva, Diego Furtado
–arXiv.org Artificial Intelligence
Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: \textit{Mapeamento e Anotações de Registros hIstóricos para NER} (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of state-of-the-art NER models for the dataset.
arXiv.org Artificial Intelligence
Jul-1-2025
- Country:
- Europe
- Italy > Liguria
- Genoa (0.04)
- Spain > Galicia
- A Coruña Province > Santiago de Compostela (0.04)
- Switzerland (0.04)
- Italy > Liguria
- North America > United States
- New Mexico > Bernalillo County > Albuquerque (0.04)
- South America
- Europe
- Genre:
- Research Report (0.64)
- Technology: