Lemmatization as a Classification Task: Results from Arabic across Multiple Genres
–arXiv.org Artificial Intelligence
Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.
arXiv.org Artificial Intelligence
Jun-24-2025
- Country:
- Africa > Middle East
- Egypt > Cairo Governorate > Cairo (0.04)
- Asia
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- South Korea (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Japan > Kyūshū & Okinawa
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Middle East > Malta (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Belgium > Brussels-Capital Region
- North America > United States
- California > Los Angeles County
- Los Angeles (0.14)
- Ohio > Franklin County
- Columbus (0.04)
- California > Los Angeles County
- Africa > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Technology: