Comparison of Current Approaches to Lemmatization: A Case Study in Estonian
Dorkin, Aleksei, Sirts, Kairit
–arXiv.org Artificial Intelligence
This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.
arXiv.org Artificial Intelligence
Apr-23-2024
- Country:
- Asia > Japan
- Kyūshū & Okinawa > Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Estonia > Tartu County
- Tartu (0.05)
- France
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Marseille (0.04)
- Île-de-France > Paris
- Paris (0.04)
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Portugal > Lisbon
- Lisbon (0.04)
- Belgium > Brussels-Capital Region
- North America > United States
- Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Japan
- Genre:
- Research Report (0.50)
- Technology: