AITopics | kurdish

Collaborating Authors

kurdish

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

KurdSTS: The Kurdish Semantic Textual Similarity

Abdullah, Abdulhady Abas, Veisi, Hadi, Al, Hussein M.

arXiv.org Artificial IntelligenceDec-1-2025

Semantic Textual Similarity measures the degree of equivalence between the two texts and is important in many Natural Language Processing tasks. While extensive resources have been developed for high - resource languages, unfortunately, low - resource languages, for example, Kurdish, have been neglected. In this paper, the first STS dataset for K urdish has been introduced, which aims to alleviate this gap. This dataset contains 10,000 formal and informal sentence pairs annotated for similarity. To this end, aft er benchmarking several models, such as Sentence Bidirectional Encoder Representations from Transformers (Sentence - BERT) and multilingual Bidirectional Encoder Representations from Transformers (multilingual BERT), among others, which achieved promising results while also showcasing the difficulties presented by the distinctive nature of Kurdish. This work paves the way for future studies in Kurdish semantic research and Natural Language Processing in general for other low - resource languages.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.02336

Country: Asia > Middle East > Iraq > Kurdistan Region (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Subword Tokenization Strategies for Kurdish Word Embeddings

Salehi, Ali, Jacobs, Cassandra L.

arXiv.org Artificial IntelligenceNov-19-2025

We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.14696

Country:

Europe (1.00)
North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification

Abdullah, Abdulhady Abas, Badawi, Soran, Abdullah, Dana A., Hamad, Dana Rasul

arXiv.org Artificial IntelligenceOct-7-2025

The complexity and difficulties of Kurdish speaker detection among its several dialects are investigated in this work. Because of its great phonetic and lexical differences, Kurdish with several dialects including Kurmanji, Sorani, and Hawrami offers special challenges for speaker recognition systems. The main difficulties in building a strong speaker identification system capable of precisely identifying speakers across several dialects are investigated in this work. To raise the accuracy and dependability of these systems, it also suggests solutions like sophisticated machine learning approaches, data augmentation tactics, and the building of thorough dialect-specific corpus. The results show that customized strategies for every dialect together with cross-dialect training greatly enhance recognition performance.

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2505.04629

Country: Asia > Middle East > Iraq > Kurdistan Region (0.29)

Genre: Research Report > New Finding (0.87)

Industry:

Government (0.68)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis

Awlla, Kozhin muhealddin, Veisi, Hadi, Abdullah, Abdulhady Abas

arXiv.org Artificial IntelligenceSep-23-2025

This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Encoder Representations from Transformers (BERT) into Natural Language Processing techniques. Kurdish is a low - resourced language, having a high level of linguistic diversity with minimal computational resources, making sentiment analysis somewhat challenging. Earlier, this was done using a traditional w ord embedding model, such as Word2Vec, but with the emergence of new language models, specifically BERT, there is hope for improvements. The better word embedding capabilities of BERT lend to this study, aiding in the capturing of the nuanced semantic pool and the contextual intricacies of the language under study, the Kurdish language, thus setting a new benchmark for sentiment analysis in low - resource languages. The steps include collecting and normalizing a large corpus of Kurdish texts, pretraining BERT with a special tokenizer for Kurdish, and developing different models for sentiment analysis including Bidirectional Long Short - Term Memory ( BiLSTM), Multi - L ayer Perceptron ( MLP), and finetuning the BERT classifier . The proposed approach consists of 3 cla sses: positive, negative, and neutral sentiment analysis using a sentiment embedding of BERT in four different configurations. The accuracy of the best - performing classifier, BiLSTM, is 74.09%. For the BERT with an MLP classifier model, the maximum accuracy achieved is 73.96%, while the fine - tuned BERT model tops the others with 75.37% accuracy. Additionally, the fine - tuned BERT model demonstrates a vast improvement when focused on t wo 2 - class sentiment analyses positive and negative with an accuracy of 86.

machine learning, natural language, sentiment analysis, (18 more...)

arXiv.org Artificial Intelligence

2509.16804

Country: Asia > Middle East > Iraq (0.68)

Genre: Research Report > New Finding (0.94)

Industry:

Media (0.68)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Abdullah, Abdulhady Abas, Gandomi, Amir H., Rashid, Tarik A, Mirjalili, Seyedali, Abualigah, Laith, Živković, Milena, Veisi, Hadi

arXiv.org Artificial IntelligenceJul-28-2025

In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.

arabic, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.18762

Country: Asia > Middle East (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Speaker Diarization for Low-Resource Languages Through Wav2vec Fine-Tuning

Abdullah, Abdulhady Abas, Karim, Sarkhel H. Taher, Ahmed, Sara Azad, Tariq, Kanar R., Rashid, Tarik A.

arXiv.org Artificial IntelligenceApr-29-2025

Speaker diarization, a core problem in speech processing, entails partitioning a given audio stream according to the speakers. Even though progress has been made in the development of the models for high - resource languages, there is still a set of specific difficulties in going through a similar process for low - resource languages such as Kurdish: there are very few annotated datasets available; the language has dialects; speakers use code - switching a lot. These challenges are met in this study by training the Wav2V ec 2.0 SSL model on a Ku rdish dataset prepared for this purpose. Thanks to transfer learning, it was possible to transfer multiling ual representations learnt in other languages to the phonetic and acoustic features of Kurdish speech. The general Diarization Error Rate (DER) was reduced by 7.2%, and the cluster purity increased by 13% when compared to the baseline algorithm. They show that making improvements in any state - of - the - art model can help in enhancing the performance of under - resourced languages. Implications of this work include transcription services for Kurdish - language media programs, as well as speaker segmentation in multilingual call centers, teleconferencing, and videoconferencing systems. Therefore, this work demonstrates that self - supervised and transfer techniques can improve speaker diarization for Kurdish and other low - resource languages with diverse features. The approach provides a ba se for building effective diarization systems in other understudied languages, which remai ns essential for speech technology's equity.

diarization, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2504.18582

Country: Asia > Middle East > Iraq > Kurdistan Region (0.28)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine (0.68)
Education (0.67)
Media (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)

Add feedback

Idiom Detection in Sorani Kurdish Texts

Omer, Skala Kamaran, Hassani, Hossein

arXiv.org Artificial IntelligenceJan-30-2025

Idiom detection using Natural Language Processing (NLP) is the computerized process of recognizing figurative expressions within a text that convey meanings beyond the literal interpretation of the words. While idiom detection has seen significant progress across various languages, the Kurdish language faces a considerable research gap in this area despite the importance of idioms in tasks like machine translation and sentiment analysis. This study addresses idiom detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques. To tackle this, we developed a dataset containing 10,580 sentences embedding 101 Sorani Kurdish idioms across diverse contexts. Using this dataset, we developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism. The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy while the RCNN achieved 96.5% and the BiLSTM 80%. These results highlight the effectiveness of Transformer-based architectures in low-resource languages like Kurdish. This research provides a dataset, three optimized models, and insights into idiom detection, laying a foundation for advancing Kurdish NLP.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.14528

Country:

Asia > Middle East > Iraq > Kurdistan Region (0.14)
Europe > Germany > Berlin (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(15 more...)

Genre: Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach

Abdullah, Abdulhady Abas, Tabibian, Shima, Veisi, Hadi, Mahmudi, Aso, Rashid, Tarik

arXiv.org Artificial IntelligenceOct-19-2024

Automatic Speech Recognition (ASR) for low-resource languages remains a challenging task due to limited training data. This paper introduces a comprehensive study exploring the effectiveness of Whisper, a pre-trained ASR model, for Northern Kurdish (Kurmanji) an under-resourced language spoken in the Middle East. We investigate three fine-tuning strategies: vanilla, specific parameters, and additional modules. Using a Northern Kurdish fine-tuning speech corpus containing approximately 68 hours of validated transcribed data, our experiments demonstrate that the additional module fine-tuning strategy significantly improves ASR accuracy on a specialized test set, achieving a Word Error Rate (WER) of 10.5% and Character Error Rate (CER) of 5.7% with Whisper version 3. These results underscore the potential of sophisticated transformer models for low-resource ASR and emphasize the importance of tailored fine-tuning techniques for optimal performance.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.1633

Country:

Europe > Middle East (0.24)
Africa > Middle East (0.24)
Asia > Middle East > Syria (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.83)

Add feedback

Shifting from endangerment to rebirth in the Artificial Intelligence Age: An Ensemble Machine Learning Approach for Hawrami Text Classification

Khaksar, Aram, Hassani, Hossein

arXiv.org Artificial IntelligenceSep-25-2024

Hawrami, a dialect of Kurdish, is classified as an endangered language as it suffers from the scarcity of data and the gradual loss of its speakers. Natural Language Processing projects can be used to partially compensate for data availability for endangered languages/dialects through a variety of approaches, such as machine translation, language model building, and corpora development. Similarly, NLP projects such as text classification are in language documentation. Several text classification studies have been conducted for Kurdish, but they were mainly dedicated to two particular dialects: Sorani (Central Kurdish) and Kurmanji (Northern Kurdish). In this paper, we introduce various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers. We use K-nearest Neighbor (KNN), Linear Support Vector Machine (Linear SVM), Logistic Regression (LR), and Decision Tree (DT) to evaluate how well those methods perform the classification task. The results indicate that the Linear SVM achieves a 96% of accuracy and outperforms the other approaches.

dataset, kurdish, scenario, (15 more...)

arXiv.org Artificial Intelligence

2409.16884

Country:

Asia > Middle East > Iran (0.04)
Europe > Middle East (0.04)
Europe > Austria > Styria > Graz (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.36)
Research Report > Experimental Study (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)

Add feedback

Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach

Abdullah, Abdulhady Abas, Muhamad, Sabat Salih, Veisi, Hadi

arXiv.org Artificial IntelligenceSep-24-2024

The ability to synthesize spoken language from text has greatly facilitated access to digital content with the advances in text-to-speech technology. However, effective TTS development for low-resource languages, such as Central Kurdish (CKB), still faces many challenges due mainly to the lack of linguistic information and dedicated resources. In this paper, we improve the Kurdish TTS system based on Tacotron by training the Kurdish WaveGlow vocoder on a 21-hour central Kurdish speech corpus instead of using a pre-trained English vocoder WaveGlow. Vocoder training on the target language corpus is required to accurately and fluently adapt phonetic and prosodic changes in Kurdish language. The effectiveness of these enhancements is that our model is significantly better than the baseline system with English pretrained models. In particular, our adaptive WaveGlow model achieves an impressive MOS of 4.91, which sets a new benchmark for Kurdish speech synthesis. On one hand, this study empowers the advanced features of the TTS system for Central Kurdish, and on the other hand, it opens the doors for other dialects in Kurdish and other related languages to further develop.

speech, synthesis, vocoder, (16 more...)

arXiv.org Artificial Intelligence

2409.13734

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Middle East > Iraq > Kurdistan Region > Sulaymaniyah Governorate (0.04)
Asia > Middle East > Iraq > Erbil Governorate > Erbil (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback