On the Role of Morphological Information for Contextual Lemmatization
Toporkov, Olia, Agerri, Rodrigo
–arXiv.org Artificial Intelligence
Lemmatization is a natural language processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising. It turns out that providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages. In fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain competitive contextual lemmatizers without seeing any explicit morphological signal. Moreover, our experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology and, finally, that current evaluation practices for lemmatization are not adequate to clearly discriminate between models.
arXiv.org Artificial Intelligence
Oct-20-2023
- Country:
- Oceania > Australia
- North America
- United States
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Maryland > Prince George's County
- College Park (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New Mexico > Santa Fe County
- Canada > British Columbia
- United States
- Europe
- Netherlands (0.04)
- Czechia > Prague (0.04)
- Russia > Central Federal District
- Moscow Oblast > Moscow (0.04)
- Latvia > Riga Municipality
- Riga (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Italy > Tuscany
- Florence (0.04)
- Germany
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Spain
- Galicia > Madrid (0.04)
- Catalonia (0.04)
- Basque Country (0.04)
- Valencian Community > Valencia Province
- Valencia (0.04)
- Canary Islands > Gran Canaria
- Las Palmas de Gran Canaria (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- South Korea (0.04)
- Singapore (0.04)
- China > Hong Kong (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Japan
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū > Kansai
- Osaka Prefecture > Osaka (0.04)
- Kyūshū & Okinawa > Kyūshū
- Africa > Middle East
- Morocco (0.04)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Technology: