An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks
Ali, Wazir, Tumrani, Saifullah, Kumar, Jay, Soomro, Tariq Rahim
–arXiv.org Artificial Intelligence
In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches.
arXiv.org Artificial Intelligence
Aug-28-2024
- Country:
- Africa > Chad
- Salamat (0.04)
- Asia
- Afghanistan > Kabul Province
- Kabul (0.04)
- Bangladesh > Dhaka Division
- Dhaka District > Dhaka (0.04)
- China > Beijing
- Beijing (0.04)
- India > Maharashtra
- Mumbai (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Malaysia > Kuala Lumpur
- Kuala Lumpur (0.04)
- Middle East
- Iran > Tehran Province
- Tehran (0.04)
- Iraq > Baghdad Governorate
- Baghdad (0.04)
- Saudi Arabia > Riyadh Province
- Riyadh (0.04)
- Iran > Tehran Province
- Pakistan
- Punjab > Lahore Division
- Lahore (0.04)
- Sindh > Karachi Division
- Karachi (0.04)
- Punjab > Lahore Division
- Afghanistan > Kabul Province
- Europe > Germany (0.04)
- North America
- Canada > Nova Scotia
- Halifax Regional Municipality > Halifax (0.04)
- United States > New York (0.04)
- Canada > Nova Scotia
- Africa > Chad
- Genre:
- Research Report (0.50)
- Technology: