Goto

Collaborating Authors

 morphological analysis


Tokens with Meaning: A Hybrid Tokenization Approach for NLP

Bayram, M. Ali, Fincan, Ali Arda, Gümüş, Ahmet Semih, Karakaş, Sercan, Diri, Banu, Yıldırım, Savaş, Çelik, Demircan

arXiv.org Artificial Intelligence

Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitabı), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.


GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian

Dorkin, Aleksei, Sirts, Kairit

arXiv.org Artificial Intelligence

Effective lemmatization enhances various downstream NLP We present GliLem--a novel hybrid tasks, including information retrieval based on lexical lemmatization system for Estonian that search and text analysis. Although dense vector enhances the highly accurate rule-based retrieval is gaining traction in information retrieval, morphological analyzer Vabamorf with an lexical search methods remain highly relevant, external disambiguation module based on particularly in modern hybrid systems. Lexical GliNER--an open vocabulary NER model search excels as a first-stage retriever due to its that is able to match text spans with text labels efficiency with inverted indices, and provides reliable in natural language. We leverage the exact term matching that dense retrievers may flexibility of a pre-trained GliNER model miss (Gao et al., 2021). Recent research demonstrates to improve the lemmatization accuracy of that lexical and dense retrieval are complementary, Vabamorf by 10% compared to its original lexical matching providing a strong foundation disambiguation module and achieve an for precise word-level matches, while dense improvement over the token classificationbased retrieval captures semantic relationships and handles baseline. To measure the impact vocabulary mismatches. The complementary of improvements in lemmatization accuracy nature of these approaches has led to state-of-theart on the information retrieval downstream hybrid systems that outperform either method task, we first created an information alone (Lee et al., 2023).


Model editing for distribution shifts in uranium oxide morphological analysis

Brown, Davis, Nizinski, Cody, Shapiro, Madelyn, Fallon, Corey, Yin, Tianzhixi, Kvinge, Henry, Tu, Jonathan H.

arXiv.org Artificial Intelligence

Deep learning still struggles with certain kinds of scientific data. Notably, pretraining data may not provide coverage of relevant distribution shifts (e.g., shifts induced via the use of different measurement instruments). We consider deep learning models trained to classify the synthesis conditions of uranium ore concentrates (UOCs) and show that model editing is particularly effective for improving generalization to distribution shifts common in this domain. In particular, model editing outperforms finetuning on two curated datasets comprising of micrographs taken of U$_{3}$O$_{8}$ aged in humidity chambers and micrographs acquired with different scanning electron microscopes, respectively.


Open-Source Web Service with Morphological Dictionary-Supplemented Deep Learning for Morphosyntactic Analysis of Czech

Straka, Milan, Straková, Jana

arXiv.org Artificial Intelligence

We present an open-source web service for Czech morphosyntactic analysis. The system combines a deep learning model with rescoring by a high-precision morphological dictionary at inference time. We show that our hybrid method surpasses two competitive baselines: While the deep learning model ensures generalization for out-of-vocabulary words and better disambiguation, an improvement over an existing morphological analyser MorphoDiTa, at the same time, the deep learning model benefits from inference-time guidance of a manually curated morphological dictionary. We achieve 50% error reduction in lemmatization and 58% error reduction in POS tagging over MorphoDiTa, while also offering dependency parsing. The model is trained on one of the currently largest Czech morphosyntactic corpora, the PDT-C 1.0, with the trained models available at https://hdl.handle.net/11234/1-5293. We provide the tool as a web service deployed at https://lindat.mff.cuni.


UzMorphAnalyser: A Morphological Analysis Model for the Uzbek Language Using Inflectional Endings

Salaev, Ulugbek

arXiv.org Artificial Intelligence

As Uzbek language is agglutinative, has many morphological features which words formed by combining root and affixes. Affixes play an important role in the morphological analysis of words, by adding additional meanings and grammatical functions to words. Inflectional endings are utilized to express various morphological features within the language. This feature introduces numerous possibilities for word endings, thereby significantly expanding the word vocabulary and exacerbating issues related to data sparsity in statistical models. This paper present modeling of the morphological analysis of Uzbek words, including stemming, lemmatizing, and the extraction of morphological information while considering morpho-phonetic exceptions. Main steps of the model involve developing a complete set of word-ending with assigned morphological information, and additional datasets for morphological analysis. The proposed model was evaluated using a curated test set comprising 5.3K words. Through manual verification of stemming, lemmatizing, and morphological feature corrections carried out by linguistic specialists, it obtained a word-level accuracy of over 91%. The developed tool based on the proposed model is available as a web-based application and an open-source Python library.


Recent advancements in computational morphology : A comprehensive survey

Baxi, Jatayu, Bhatt, Brijesh

arXiv.org Artificial Intelligence

Computational morphology handles the language processing at the word level. It is one of the foundational tasks in the NLP pipeline for the development of higher level NLP applications. It mainly deals with the processing of words and word forms. Computational Morphology addresses various sub problems such as morpheme boundary detection, lemmatization, morphological feature tagging, morphological reinflection etc. In this paper, we present exhaustive survey of the methods for developing computational morphology related tools. We survey the literature in the chronological order starting from the conventional methods till the recent evolution of deep neural network based approaches. We also review the existing datasets available for this task across the languages. We discuss about the effectiveness of neural model compared with the traditional models and present some unique challenges associated with building the computational morphology tools. We conclude by discussing some recent and open research issues in this field.


Can Chat GPT solve a Linguistics Exam?

Ronan, Patricia, Schneider, Gerold

arXiv.org Artificial Intelligence

The present study asks if ChatGPT4, the version of ChatGPT which uses the language model GPT4, can successfully solve introductory linguistic exams. Previous exam questions of an Introduction to Linguistics course at a German university are used to test this. The exam questions were fed into ChatGPT4 with only minimal preprocessing. The results show that the language model is very successful in the interpretation even of complex and nested tasks. It proved surprisingly successful in the task of broad phonetic transcription, but performed less well in the analysis of morphemes and phrases. In simple cases it performs sufficiently well, but rarer cases, particularly with missing one-to-one correspondence, are currently treated with mixed results. The model is not yet able to deal with visualisations, such as the analysis or generation of syntax trees. More extensive preprocessing, which translates these tasks into text data, allow the model to also solve these tasks successfully.


Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain

AlShuhayeb, Huda, Minaei-Bidgoli, Behrouz, Shenassa, Mohammad E., Hossayni, Sayyed-Ali

arXiv.org Artificial Intelligence

There are many complex and rich morphological subtleties in the Arabic language, which are very useful when analyzing traditional Arabic texts, especially in the historical and religious contexts, and help in understanding the meaning of the texts. Vocabulary separation means separating the word into different parts such as root and affix. In the morphological datasets, the variety of labels and the number of data samples helps to evaluate the morphological methods. In this paper, we present a benchmark data set for evaluating the methods of separating Arabic words which include about 223,690 words from the book of Sharia alIslam, which have been labeled by experts. In terms of the volume and variety of words, this dataset is superior to other existing data sets, and as far as we know, there are no Arabic Hadith Domain texts. To evaluate the dataset, we applied different methods such as Farasa, Camel, Madamira, and ALP to the dataset and we reported the annotation quality through four evaluation methods.


Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie

Yoshinaga, Naoki

arXiv.org Artificial Intelligence

Accurate neural models are much less efficient than non-neural models and are useless for processing billions of social media posts or handling user queries in real time with a limited budget. This study revisits the fastest pattern-based NLP methods to make them as accurate as possible, thus yielding a strikingly simple yet surprisingly accurate morphological analyzer for Japanese. The proposed method induces reliable patterns from a morphological dictionary and annotated data. Experimental results on two standard datasets confirm that the method exhibits comparable accuracy to learning-based baselines, while boasting a remarkable throughput of over 1,000,000 sentences per second on a single modern CPU. The source code is available at https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/


Morpheme Boundary Detection & Grammatical Feature Prediction for Gujarati : Dataset & Model

Baxi, Jatayu, Bhatt, Dr. Brijesh

arXiv.org Artificial Intelligence

Developing Natural Language Processing resources for a low resource language is a challenging but essential task. In this paper, we present a Morphological Analyzer for Gujarati. We have used a Bi-Directional LSTM based approach to perform morpheme boundary detection and grammatical feature tagging. We have created a data set of Gujarati words with lemma and grammatical features. The Bi-LSTM based model of Morph Analyzer discussed in the paper handles the language morphology effectively without the knowledge of any hand-crafted suffix rules. To the best of our knowledge, this is the first dataset and morph analyzer model for the Gujarati language which performs both grammatical feature tagging and morpheme boundary detection tasks.