QuranMorph: Morphologically Annotated Quranic Corpus
Akra, Diyam, Hammouda, Tymaa, Jarrar, Mustafa
–arXiv.org Artificial Intelligence
We present the QuranMorph corpus, a morphologically annotated corpus for the Quran (77,429 tokens). Each token in the QuranMorph was manually lemmatized and tagged with its part-of-speech by three expert linguists. The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database linked with 110 lexicons and corpora of 2 million tokens. The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset, which encompasses 40 tags. As shown in this paper, this rich lemmatization and POS tagset enabled the QuranMorph corpus to be inter-linked with many linguistic resources. The corpus is open-source and publicly available as part of the SinaLab resources at (https://sina.birzeit.edu/quran)
arXiv.org Artificial Intelligence
Jun-24-2025
- Country:
- Africa
- Middle East > Egypt
- Cairo Governorate > Cairo (0.04)
- Sudan (0.05)
- Middle East > Egypt
- Asia > Middle East
- UAE > Dubai Emirate > Dubai (0.05)
- Europe
- Czechia > Prague (0.05)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Piedmont
- Turin Province > Turin (0.04)
- Africa
- Genre:
- Research Report (0.40)
- Technology: