Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

Mohamed, Shafie Abdi, Mohamed, Muhidin Abdullahi

Aug-3-2023–arXiv.org Artificial Intelligence

The lemmatization summary statistics of the Example 3 sentence are also provided in Table 1. In this case, the percentage of words that were normalized for the example reached 100%, which means that all content words (excluding stop words and special characters) are lemmatized. This may be due to the fact that this is a short document, a sentence of 8 words. Unlike the lemmatization statistics of this example, a proportion of words in any typical text document (i.e., longer than a sentence) will normally remain unresolved - words that the algorithm fails to lemmatize in both stages. Overall and as part of evaluating the proposed method, we have tested the algorithm on 120 documents of various lengths including general news articles, and social media posts. For the news articles, we have used extracts (i.e., title and first 1-2 paragraphs) as well as the full articles to see the effect of document length. The results we found for these different document categories are summarized in Table 2. The notations #Docs, Avg Doc Len, and Avg Acc. in the table respectively represent the number of documents, average document length in words, and average lemmatization accuracy. As shown, the results demonstrate that the algorithm achieves a relatively good accuracy of 57% for moderately long documents (e.g.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Aug-3-2023

arXiv.org PDF

Add feedback

Country:
- Africa > Middle East
  - Somalia (0.14)
- North America > United States (0.28)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (1.00)
  - Representation & Reasoning > Rule-Based Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found