Automatic Textual Normalization for Hate Speech Detection

Nguyen, Anh Thi-Hoang, Nguyen, Dung Ha, Nguyen, Nguyet Thi, Ho, Khanh Thanh-Duy, Van Nguyen, Kiet

Dec-4-2023–arXiv.org Artificial Intelligence

Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.

automatic textual normalization, normalization, textual normalization, (11 more...)

arXiv.org Artificial Intelligence

Dec-4-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Portugal
  - Lisbon > Lisbon (0.04)
- Asia
  - Vietnam > Hồ Chí Minh City
    - Hồ Chí Minh City (0.04)
  - Singapore > Central Region
    - Singapore (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report > New Finding (0.88)

Technology:
- Information Technology
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Natural Language > Text Processing (0.94)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)