lemmatizer
A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs
Akavarapu, V. S. D. S. Mahesh, Terdalkar, Hrishikesh, Bhattacharyya, Pramit, Agarwal, Shubhangi, Deulgaonkar, Vishakha, Manna, Pralay, Dangarikar, Chaitali, Bhattacharya, Arnab
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin -- to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question-answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.
A Simple Joint Model for Improved Contextual Neural Lemmatization
Malaviya, Chaitanya, Wu, Shijie, Cotterell, Ryan
English verbs have multiple forms. For instance, talk may also appear as talks, talked or talking, depending on the context. The NLP task of lemmatization seeks to map these diverse forms back to a canonical one, known as the lemma. We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages from the Universal Dependencies corpora. Our paper describes the model in addition to training and decoding procedures. Error analysis indicates that joint morphological tagging and lemmatization is especially Figure 1: Our structured neural model shown as a hybrid helpful in low-resource lemmatization and languages (directed-undirected) graphical model (Koller and that display a larger degree of morphological Friedman, 2009).
BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer
Afrin, Sadia, Chowdhury, Md. Shahad Mahmud, Islam, Md. Ekramul, Khan, Faisal Ahamed, Chowdhury, Labib Imam, Mahtab, MD. Motahar, Chowdhury, Nazifa Nuha, Forkan, Massud, Kundu, Neelima, Arif, Hakim, Rashid, Mohammad Mamun Or, Amin, Mohammad Ruhul, Mohammed, Nabeel
Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP.
On the Role of Morphological Information for Contextual Lemmatization
Toporkov, Olia, Agerri, Rodrigo
Lemmatization is a natural language processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising. It turns out that providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages. In fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain competitive contextual lemmatizers without seeing any explicit morphological signal. Moreover, our experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology and, finally, that current evaluation practices for lemmatization are not adequate to clearly discriminate between models.
Lexicon and Rule-based Word Lemmatization Approach for the Somali Language
Mohamed, Shafie Abdi, Mohamed, Muhidin Abdullahi
The lemmatization summary statistics of the Example 3 sentence are also provided in Table 1. In this case, the percentage of words that were normalized for the example reached 100%, which means that all content words (excluding stop words and special characters) are lemmatized. This may be due to the fact that this is a short document, a sentence of 8 words. Unlike the lemmatization statistics of this example, a proportion of words in any typical text document (i.e., longer than a sentence) will normally remain unresolved - words that the algorithm fails to lemmatize in both stages. Overall and as part of evaluating the proposed method, we have tested the algorithm on 120 documents of various lengths including general news articles, and social media posts. For the news articles, we have used extracts (i.e., title and first 1-2 paragraphs) as well as the full articles to see the effect of document length. The results we found for these different document categories are summarized in Table 2. The notations #Docs, Avg Doc Len, and Avg Acc. in the table respectively represent the number of documents, average document length in words, and average lemmatization accuracy. As shown, the results demonstrate that the algorithm achieves a relatively good accuracy of 57% for moderately long documents (e.g.
Hybrid lemmatization in HuSpaCy
Berkecz, Péter, Orosz, György, Szántó, Zsolt, Szabó, Gergő, Farkas, Richárd
Lemmatization is still not a trivial task for morphologically rich languages. Previous studies showed that hybrid architectures usually work better for these languages and can yield great results. This paper presents a hybrid lemmatizer utilizing both a neural model, dictionaries and hand-crafted rules. We introduce a hybrid architecture along with empirical results on a widely used Hungarian dataset. The presented methods are published as three HuSpaCy models.
An Open-Source Gloss-Based Baseline for Spoken to Signed Language Translation
Moryossef, Amit, Müller, Mathias, Göhring, Anne, Jiang, Zifan, Goldberg, Yoav, Ebling, Sarah
Sign language translation systems are complex and require many components. As a result, it is very hard to compare methods across publications. We present an open-source implementation of a text-to-gloss-to-pose-to-video pipeline approach, demonstrating conversion from German to Swiss German Sign Language, French to French Sign Language of Switzerland, and Italian to Italian Sign Language of Switzerland. We propose three different components for the text-to-gloss translation: a lemmatizer, a rule-based word reordering and dropping component, and a neural machine translation system. Gloss-to-pose conversion occurs using data from a lexicon for three different signed languages, with skeletal poses extracted from videos. To generate a sentence, the text-to-gloss system is first run, and the pose representations of the resulting signs are stitched together.
LatinCy: Synthetic Trained Pipelines for Latin NLP
This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework (Honnibal and Montani, 2023). These are end-to-end pipelines for taking plaintext Latin as input for basic NLP processing including sentence segmentation, word tokenization, lemmatization, part-of-speech and morphological tagging, dependency parsing, and named entity recognition (NER). Three models have so far been trained, named according to spaCy conventions: la_core_web_sm, la_core_web_md, and la_core_web_lg. To clarify, 'la' refers to the language code for Latin, 'core' refers to a pipeline that includes all of the components named above, including specifically NER; 'web' refers to the nature of the training data, specifically that the model is trained primarily on Universal Dependency treebanks; and'sm', 'md', and'lg' refer to the "size"--i.e., small, medium, or large--of the models, with'md' and'lg' models being larger because they include subword vectors that describe the vocabulary while'sm' models do not. The current default pipeline consists of the following spaCy components: 'tagger', 'morphologizer', 'trainable_lemmatizer' (i.e. the EditTreeLemmatizer based on Müller et al., 2015),
Context based lemmatizer for Polish language
Karwatowski, Michal, Pietron, Marcin
Natural Language Processing consists of many tasks, the role of each is extracting and processing human understandable meaning from the text data. Some tasks like classification encompass the complete flow from data to answer, in other tasks like part of speech tagging, results are often used as an input for next algorithms. An interesting and complex problem is translation, where the meaning of the text needs to be extracted and encoded back to the text in a different language. This approach describes a family of NLP tasks called text-to-text or sequence-to-sequence processing. Another example of text-to-text processing is lemmatisation, it finds a base form of a given word or expression. Complexity of this problem varies from language to language. In English the number of word variations is usually low, there are simple rules and not many exceptions. However in Slavic languages such as Polish inflection of words it is significantly more complicated and effective lemmatisation is beyond capabilities of a rule based or edit tree classification methods [1], [2]. Situation becomes more difficult when we include multi-segment expressions.
Text Mining in Python: Steps and Examples
In today's world, according to the industry estimates only 20 percent of the data in the structured format is being generated as we speak as we tweet as we send messages on What's App, email, Facebook, Instagram or any text messages. And, the majority of this data exists in the textual form which is highly unstructured format, in order to produce meaningful insights from the text data then we need to access a method called Text Analysis. Text Mining is the process of deriving meaningful information from natural language text. Natural Language Processing(NLP) is a part of computer science and artificial intelligence which deals with human languages. In other words, NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine "read" text.