AITopics | morphological analysis

Collaborating Authors

morphological analysis

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

Bayram, M. Ali, Fincan, Ali Arda, Gümüş, Ahmet Semih, Karakaş, Sercan, Diri, Banu, Yıldırım, Savaş, Çelik, Demircan

arXiv.org Artificial IntelligenceAug-21-2025

Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitabı), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2508.14292

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
(2 more...)

Add feedback

GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian

Dorkin, Aleksei, Sirts, Kairit

arXiv.org Artificial IntelligenceJan-11-2025

Effective lemmatization enhances various downstream NLP We present GliLem--a novel hybrid tasks, including information retrieval based on lexical lemmatization system for Estonian that search and text analysis. Although dense vector enhances the highly accurate rule-based retrieval is gaining traction in information retrieval, morphological analyzer Vabamorf with an lexical search methods remain highly relevant, external disambiguation module based on particularly in modern hybrid systems. Lexical GliNER--an open vocabulary NER model search excels as a first-stage retriever due to its that is able to match text spans with text labels efficiency with inverted indices, and provides reliable in natural language. We leverage the exact term matching that dense retrievers may flexibility of a pre-trained GliNER model miss (Gao et al., 2021). Recent research demonstrates to improve the lemmatization accuracy of that lexical and dense retrieval are complementary, Vabamorf by 10% compared to its original lexical matching providing a strong foundation disambiguation module and achieve an for precise word-level matches, while dense improvement over the token classificationbased retrieval captures semantic relationships and handles baseline. To measure the impact vocabulary mismatches. The complementary of improvements in lemmatization accuracy nature of these approaches has led to state-of-theart on the information retrieval downstream hybrid systems that outperform either method task, we first created an information alone (Lee et al., 2023).

artificial intelligence, information retrieval, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.20597

Country:

Europe (1.00)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Model editing for distribution shifts in uranium oxide morphological analysis

Brown, Davis, Nizinski, Cody, Shapiro, Madelyn, Fallon, Corey, Yin, Tianzhixi, Kvinge, Henry, Tu, Jonathan H.

arXiv.org Artificial IntelligenceJul-22-2024

Deep learning still struggles with certain kinds of scientific data. Notably, pretraining data may not provide coverage of relevant distribution shifts (e.g., shifts induced via the use of different measurement instruments). We consider deep learning models trained to classify the synthesis conditions of uranium ore concentrates (UOCs) and show that model editing is particularly effective for improving generalization to distribution shifts common in this domain. In particular, model editing outperforms finetuning on two curated datasets comprising of micrographs taken of U$_{3}$O$_{8}$ aged in humidity chambers and micrographs acquired with different scanning electron microscopes, respectively.

dataset, distribution shift, editing, (17 more...)

arXiv.org Artificial Intelligence

2407.15756

Country:

North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
Asia > China > Liaoning Province > Shenyang (0.04)

Genre: Research Report (0.64)

Industry:

Energy (0.47)
Government > Regional Government (0.46)
Materials (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Open-Source Web Service with Morphological Dictionary-Supplemented Deep Learning for Morphosyntactic Analysis of Czech

Straka, Milan, Straková, Jana

arXiv.org Artificial IntelligenceJun-18-2024

We present an open-source web service for Czech morphosyntactic analysis. The system combines a deep learning model with rescoring by a high-precision morphological dictionary at inference time. We show that our hybrid method surpasses two competitive baselines: While the deep learning model ensures generalization for out-of-vocabulary words and better disambiguation, an improvement over an existing morphological analyser MorphoDiTa, at the same time, the deep learning model benefits from inference-time guidance of a manually curated morphological dictionary. We achieve 50% error reduction in lemmatization and 58% error reduction in POS tagging over MorphoDiTa, while also offering dependency parsing. The model is trained on one of the currently largest Czech morphosyntactic corpora, the PDT-C 1.0, with the trained models available at https://hdl.handle.net/11234/1-5293. We provide the tool as a web service deployed at https://lindat.mff.cuni.

error reduction, pdt-c 1, udpipe 2, (12 more...)

arXiv.org Artificial Intelligence

2406.12422

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Czechia > Prague (0.06)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(4 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

UzMorphAnalyser: A Morphological Analysis Model for the Uzbek Language Using Inflectional Endings

Salaev, Ulugbek

arXiv.org Artificial IntelligenceJun-12-2024

As Uzbek language is agglutinative, has many morphological features which words formed by combining root and affixes. Affixes play an important role in the morphological analysis of words, by adding additional meanings and grammatical functions to words. Inflectional endings are utilized to express various morphological features within the language. This feature introduces numerous possibilities for word endings, thereby significantly expanding the word vocabulary and exacerbating issues related to data sparsity in statistical models. This paper present modeling of the morphological analysis of Uzbek words, including stemming, lemmatizing, and the extraction of morphological information while considering morpho-phonetic exceptions. Main steps of the model involve developing a complete set of word-ending with assigned morphological information, and additional datasets for morphological analysis. The proposed model was evaluated using a curated test set comprising 5.3K words. Through manual verification of stemming, lemmatizing, and morphological feature corrections carried out by linguistic specialists, it obtained a word-level accuracy of over 91%. The developed tool based on the proposed model is available as a web-based application and an open-source Python library.

dataset, morphological analysis, uzbek language, (10 more...)

arXiv.org Artificial Intelligence

2405.14179

Country:

Asia > Uzbekistan (0.05)
Europe > Switzerland (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.30)

Add feedback

Recent advancements in computational morphology : A comprehensive survey

Baxi, Jatayu, Bhatt, Brijesh

arXiv.org Artificial IntelligenceJun-8-2024

Computational morphology handles the language processing at the word level. It is one of the foundational tasks in the NLP pipeline for the development of higher level NLP applications. It mainly deals with the processing of words and word forms. Computational Morphology addresses various sub problems such as morpheme boundary detection, lemmatization, morphological feature tagging, morphological reinflection etc. In this paper, we present exhaustive survey of the methods for developing computational morphology related tools. We survey the literature in the chronological order starting from the conventional methods till the recent evolution of deep neural network based approaches. We also review the existing datasets available for this task across the languages. We discuss about the effectiveness of neural model compared with the traditional models and present some unique challenges associated with building the computational morphology tools. We conclude by discussing some recent and open research issues in this field.

computational linguistic, morphology, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2406.05424

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > Canada > Ontario > Toronto (0.04)
(32 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Can Chat GPT solve a Linguistics Exam?

Ronan, Patricia, Schneider, Gerold

arXiv.org Artificial IntelligenceNov-4-2023

The present study asks if ChatGPT4, the version of ChatGPT which uses the language model GPT4, can successfully solve introductory linguistic exams. Previous exam questions of an Introduction to Linguistics course at a German university are used to test this. The exam questions were fed into ChatGPT4 with only minimal preprocessing. The results show that the language model is very successful in the interpretation even of complex and nested tasks. It proved surprisingly successful in the task of broad phonetic transcription, but performed less well in the analysis of morphemes and phrases. In simple cases it performs sufficiently well, but rarer cases, particularly with missing one-to-one correspondence, are currently treated with mixed results. The model is not yet able to deal with visualisations, such as the analysis or generation of syntax trees. More extensive preprocessing, which translates these tasks into text data, allow the model to also solve these tasks successfully.

chatgpt, exam, language model, (15 more...)

arXiv.org Artificial Intelligence

2311.02499

Country:

Europe > United Kingdom (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Germany (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain

AlShuhayeb, Huda, Minaei-Bidgoli, Behrouz, Shenassa, Mohammad E., Hossayni, Sayyed-Ali

arXiv.org Artificial IntelligenceJun-22-2023

There are many complex and rich morphological subtleties in the Arabic language, which are very useful when analyzing traditional Arabic texts, especially in the historical and religious contexts, and help in understanding the meaning of the texts. Vocabulary separation means separating the word into different parts such as root and affix. In the morphological datasets, the variety of labels and the number of data samples helps to evaluate the morphological methods. In this paper, we present a benchmark data set for evaluating the methods of separating Arabic words which include about 223,690 words from the book of Sharia alIslam, which have been labeled by experts. In terms of the volume and variety of words, this dataset is superior to other existing data sets, and as far as we know, there are no Arabic Hadith Domain texts. To evaluate the dataset, we applied different methods such as Farasa, Camel, Madamira, and ALP to the dataset and we reported the annotation quality through four evaluation methods.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2307.0963

Country:

Europe > Czechia > Prague (0.05)
Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
North America > United States > Pennsylvania (0.04)
(6 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie

Yoshinaga, Naoki

arXiv.org Artificial IntelligenceMay-30-2023

Accurate neural models are much less efficient than non-neural models and are useless for processing billions of social media posts or handling user queries in real time with a limited budget. This study revisits the fastest pattern-based NLP methods to make them as accurate as possible, thus yielding a strikingly simple yet surprisingly accurate morphological analyzer for Japanese. The proposed method induces reliable patterns from a morphological dictionary and annotated data. Experimental results on two standard datasets confirm that the method exhibits comparable accuracy to learning-based baselines, while boasting a remarkable throughput of over 1,000,000 sentences per second on a single modern CPU. The source code is available at https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.19045

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.25)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.07)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(13 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.74)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Morpheme Boundary Detection & Grammatical Feature Prediction for Gujarati : Dataset & Model

Baxi, Jatayu, Bhatt, Dr. Brijesh

arXiv.org Artificial IntelligenceDec-18-2021

Developing Natural Language Processing resources for a low resource language is a challenging but essential task. In this paper, we present a Morphological Analyzer for Gujarati. We have used a Bi-Directional LSTM based approach to perform morpheme boundary detection and grammatical feature tagging. We have created a data set of Gujarati words with lemma and grammatical features. The Bi-LSTM based model of Morph Analyzer discussed in the paper handles the language morphology effectively without the knowledge of any hand-crafted suffix rules. To the best of our knowledge, this is the first dataset and morph analyzer model for the Gujarati language which performs both grammatical feature tagging and morpheme boundary detection tasks.

analyzer, gujarati, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2112.0986

Country:

Asia > India > Gujarat (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(8 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback