Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection

Harris, Sheetal, Liu, Jinshuo, Hadi, Hassan Jalil, Cao, Yue

Mar-20-2024–arXiv.org Artificial Intelligence

Abstract: Misinformation can seriously impact society, affecting anything from public opinion to institutional confidence and the political horizon of a state. Fake News (FN) proliferation on online websites and Online Social Networks (OSNs) has increased profusely. Various fact-checking websites include news in English and barely provide information about FN in regional languages. Thus the Urdu FN purveyors cannot be discerned using fact-checking portals. FND in regional and resourceconstrained languages lags due to the lack of limited-sized datasets and legitimate lexical resources. The previous datasets for Urdu FND are limited-sized, domain-restricted, publicly unavailable and not manually verified where the news is translated from English into Urdu. In this paper, we curate and contribute the first largest publicly available dataset for Urdu FND, "Ax-to-Grind Urdu", to bridge the identified gaps and limitations of existing Urdu datasets in the literature. It constitutes 10,083 fake and real news on fifteen domains collected from leading and authentic Urdu newspapers and news channel websites in Pakistan and India. The dataset contains news items in Urdu from the year 2017 to the year 2023. The selected models are originally trained on multilingual large corpora. The results of the proposed model are based on performance metrics, F1-score, accuracy, precision, recall and MCC value. F1-score of 0.924, accuracy of 0.956, precision of 0.942, recall of 0.940 and an MCC value of 0.902 demonstrate the effectiveness of the proposed approach for Urdu FND. Comparison analysis with SOTA ML and DL models and existing Urdu benchmark datasets exhibit that the ensemble model outperforms them for Urdu FND.

dataset, ensemble model, urdu fnd, (12 more...)

arXiv.org Artificial Intelligence

Mar-20-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Orange County > Irvine (0.04)
- Europe
  - Ukraine (0.04)
  - Russia (0.04)
  - Norway > Eastern Norway
    - Oslo (0.04)
- Asia
  - Pakistan (0.26)
  - India (0.25)
  - Russia (0.04)
  - Singapore (0.04)
  - China > Hubei Province
    - Wuhan (0.04)

Genre:
- Research Report (1.00)

Industry:
- Media > News (1.00)

Technology:
- Information Technology
  - Communications (1.00)
  - Artificial Intelligence
    - Natural Language
      - Text Processing (1.00)
      - Information Retrieval (0.70)
    - Machine Learning
      - Statistical Learning (1.00)
      - Neural Networks > Deep Learning (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found