AITopics | lemmatizer

Collaborating Authors

lemmatizer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs

Akavarapu, V. S. D. S. Mahesh, Terdalkar, Hrishikesh, Bhattacharyya, Pramit, Agarwal, Shubhangi, Deulgaonkar, Vishakha, Manna, Pralay, Dangarikar, Chaitali, Bhattacharya, Arnab

arXiv.org Artificial IntelligenceJun-3-2025

Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin -- to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question-answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.13173

Country:

Europe (1.00)
Asia > Middle East (0.67)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Simple Joint Model for Improved Contextual Neural Lemmatization

Malaviya, Chaitanya, Wu, Shijie, Cotterell, Ryan

arXiv.org Artificial IntelligenceMay-28-2024

English verbs have multiple forms. For instance, talk may also appear as talks, talked or talking, depending on the context. The NLP task of lemmatization seeks to map these diverse forms back to a canonical one, known as the lemma. We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages from the Universal Dependencies corpora. Our paper describes the model in addition to training and decoding procedures. Error analysis indicates that joint morphological tagging and lemmatization is especially Figure 1: Our structured neural model shown as a hybrid helpful in low-resource lemmatization and languages (directed-undirected) graphical model (Koller and that display a larger degree of morphological Friedman, 2009).

computational linguistic, lemmatization, linguistic, (14 more...)

arXiv.org Artificial Intelligence

1904.02306

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Washington > King County > Seattle (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(7 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

Afrin, Sadia, Chowdhury, Md. Shahad Mahmud, Islam, Md. Ekramul, Khan, Faisal Ahamed, Chowdhury, Labib Imam, Mahtab, MD. Motahar, Chowdhury, Nazifa Nuha, Forkan, Massud, Kundu, Neelima, Arif, Hakim, Rashid, Mohammad Mamun Or, Amin, Mohammad Ruhul, Mohammed, Nabeel

arXiv.org Artificial IntelligenceNov-6-2023

Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP.

dataset, lemma, lemmatizer, (17 more...)

arXiv.org Artificial Intelligence

2311.03078

Country:

Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
Asia > Indonesia > Bali (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(7 more...)

Genre: Research Report > New Finding (0.66)

Industry: Media > News (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.89)

Add feedback

On the Role of Morphological Information for Contextual Lemmatization

Toporkov, Olia, Agerri, Rodrigo

arXiv.org Artificial IntelligenceOct-20-2023

Lemmatization is a natural language processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising. It turns out that providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages. In fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain competitive contextual lemmatizers without seeing any explicit morphological signal. Moreover, our experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology and, finally, that current evaluation practices for lemmatization are not adequate to clearly discriminate between models.

computational linguistic, lemmatization, lemmatizer, (12 more...)

arXiv.org Artificial Intelligence

2302.00407

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(29 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

Mohamed, Shafie Abdi, Mohamed, Muhidin Abdullahi

arXiv.org Artificial IntelligenceAug-3-2023

The lemmatization summary statistics of the Example 3 sentence are also provided in Table 1. In this case, the percentage of words that were normalized for the example reached 100%, which means that all content words (excluding stop words and special characters) are lemmatized. This may be due to the fact that this is a short document, a sentence of 8 words. Unlike the lemmatization statistics of this example, a proportion of words in any typical text document (i.e., longer than a sentence) will normally remain unresolved - words that the algorithm fails to lemmatize in both stages. Overall and as part of evaluating the proposed method, we have tested the algorithm on 120 documents of various lengths including general news articles, and social media posts. For the news articles, we have used extracts (i.e., title and first 1-2 paragraphs) as well as the full articles to see the effect of document length. The results we found for these different document categories are summarized in Table 2. The notations #Docs, Avg Doc Len, and Avg Acc. in the table respectively represent the number of documents, average document length in words, and average lemmatization accuracy. As shown, the results demonstrate that the algorithm achieves a relatively good accuracy of 57% for moderately long documents (e.g.

lemmatization, lexicon, root word, (14 more...)

arXiv.org Artificial Intelligence

2308.01785

Country:

North America > United States > Washington > King County > Seattle (0.04)
Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
Africa > Middle East > Somalia > Banaadir > Mogadishu (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Hybrid lemmatization in HuSpaCy

Berkecz, Péter, Orosz, György, Szántó, Zsolt, Szabó, Gergő, Farkas, Richárd

arXiv.org Artificial IntelligenceJun-13-2023

Lemmatization is still not a trivial task for morphologically rich languages. Previous studies showed that hybrid architectures usually work better for these languages and can yield great results. This paper presents a hybrid lemmatizer utilizing both a neural model, dictionaries and hand-crafted rules. We introduce a hybrid architecture along with empirical results on a widely used Hungarian dataset. The presented methods are published as three HuSpaCy models.

lemmatizer, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2306.07636

Country:

Europe > Hungary > Csongrád-Csanád County > Szeged (0.08)
Europe > Bulgaria (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)

Add feedback

An Open-Source Gloss-Based Baseline for Spoken to Signed Language Translation

Moryossef, Amit, Müller, Mathias, Göhring, Anne, Jiang, Zifan, Goldberg, Yoav, Ebling, Sarah

arXiv.org Artificial IntelligenceMay-28-2023

Sign language translation systems are complex and require many components. As a result, it is very hard to compare methods across publications. We present an open-source implementation of a text-to-gloss-to-pose-to-video pipeline approach, demonstrating conversion from German to Swiss German Sign Language, French to French Sign Language of Switzerland, and Italian to Italian Sign Language of Switzerland. We propose three different components for the text-to-gloss translation: a lemmatizer, a rule-based word reordering and dropping component, and a neural machine translation system. Gloss-to-pose conversion occurs using data from a lexicon for three different signed languages, with skeletal poses extracted from videos. To generate a sentence, the text-to-gloss system is first run, and the pose representations of the resulting signs are stitched together.

artificial intelligence, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2305.17714

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(11 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

LatinCy: Synthetic Trained Pipelines for Latin NLP

Burns, Patrick J.

arXiv.org Artificial IntelligenceMay-7-2023

This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework (Honnibal and Montani, 2023). These are end-to-end pipelines for taking plaintext Latin as input for basic NLP processing including sentence segmentation, word tokenization, lemmatization, part-of-speech and morphological tagging, dependency parsing, and named entity recognition (NER). Three models have so far been trained, named according to spaCy conventions: la_core_web_sm, la_core_web_md, and la_core_web_lg. To clarify, 'la' refers to the language code for Latin, 'core' refers to a pipeline that includes all of the components named above, including specifically NER; 'web' refers to the nature of the training data, specifically that the model is trained primarily on Universal Dependency treebanks; and'sm', 'md', and'lg' refer to the "size"--i.e., small, medium, or large--of the models, with'md' and'lg' models being larger because they include subword vectors that describe the vocabulary while'sm' models do not. The current default pipeline consists of the following spaCy components: 'tagger', 'morphologizer', 'trainable_lemmatizer' (i.e. the EditTreeLemmatizer based on Müller et al., 2015),

artificial intelligence, natural language, pipeline, (18 more...)

arXiv.org Artificial Intelligence

2305.04365

Country: North America > United States > New York (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Context based lemmatizer for Polish language

Karwatowski, Michal, Pietron, Marcin

arXiv.org Artificial IntelligenceJul-23-2022

Natural Language Processing consists of many tasks, the role of each is extracting and processing human understandable meaning from the text data. Some tasks like classification encompass the complete flow from data to answer, in other tasks like part of speech tagging, results are often used as an input for next algorithms. An interesting and complex problem is translation, where the meaning of the text needs to be extracted and encoded back to the text in a different language. This approach describes a family of NLP tasks called text-to-text or sequence-to-sequence processing. Another example of text-to-text processing is lemmatisation, it finds a base form of a given word or expression. Complexity of this problem varies from language to language. In English the number of word variations is usually low, there are simple rules and not many exceptions. However in Slavic languages such as Polish inflection of words it is significantly more complicated and effective lemmatisation is beyond capabilities of a rule based or edit tree classification methods [1], [2]. Situation becomes more difficult when we include multi-segment expressions.

computational linguistic, lemmatizer, proceedings, (10 more...)

arXiv.org Artificial Intelligence

2207.11565

Country:

Europe > Poland > Lesser Poland Province > Kraków (0.05)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Text Mining in Python: Steps and Examples

#artificialintelligenceAug-22-2019, 18:54:25 GMT

In today's world, according to the industry estimates only 20 percent of the data in the structured format is being generated as we speak as we tweet as we send messages on What's App, email, Facebook, Instagram or any text messages. And, the majority of this data exists in the textual form which is highly unstructured format, in order to produce meaningful insights from the text data then we need to access a method called Text Analysis. Text Mining is the process of deriving meaningful information from natural language text. Natural Language Processing(NLP) is a part of computer science and artificial intelligence which deals with human languages. In other words, NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine "read" text.

artificial intelligence, data mining, natural language, (11 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.85)

Add feedback