An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks

Ali, Wazir, Tumrani, Saifullah, Kumar, Jay, Soomro, Tariq Rahim

Aug-28-2024–arXiv.org Artificial Intelligence

In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches.

corpus, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Aug-28-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Germany (0.04)
- North America
  - United States > New York (0.04)
  - Canada > Nova Scotia
    - Halifax Regional Municipality > Halifax (0.04)
- Asia
  - Pakistan
    - Sindh > Karachi Division
      - Karachi (0.04)
    - Punjab > Lahore Division
      - Lahore (0.04)
  - Middle East
    - Saudi Arabia > Riyadh Province
      - Riyadh (0.04)
    - Iraq > Baghdad Governorate
      - Baghdad (0.04)
    - Iran > Tehran Province
      - Tehran (0.04)
  - Malaysia > Kuala Lumpur
    - Kuala Lumpur (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
  - India > Maharashtra
    - Mumbai (0.04)
  - China > Beijing
    - Beijing (0.04)
  - Bangladesh > Dhaka Division
    - Dhaka District > Dhaka (0.04)
  - Afghanistan > Kabul Province
    - Kabul (0.04)
- Africa > Chad
  - Salamat (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found