An Annotated Corpus of Arabic Tweets for Hate Speech Analysis

May-26-2025–arXiv.org Artificial Intelligence

Identifying hate speech content in the Arabic language is challenging due to the rich quality of dialectal variations. This study introduces a multilabel hate speech dataset in the Arabic language. We have collected 10000 Arabic tweets and annotated each tweet, whether it contains offensive content or not. If a text contains offensive content, we further classify it into different hate speech targets such as religion, gender, politics, ethnicity, origin, and others. A text can contain either single or multiple targets. Multiple annotators are involved in the data annotation task. We calculated the inter-annotator agreement, which was reported to be 0.86 for offensive content and 0.71 for multiple hate speech targets. Finally, we evaluated the data annotation task by employing a different transformers-based model in which AraBERTv2 outperformed with a micro-F1 score of 0.7865 and an accuracy of 0.786.

machine learning, natural language, tweet, (21 more...)

arXiv.org Artificial Intelligence

May-26-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)
- Asia > Middle East
  - Qatar (0.15)

Genre:
- Research Report (0.82)

Industry:
- Information Technology (1.00)

Technology:
- Information Technology
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.90)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found