Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

Jan-17-2024–arXiv.org Artificial Intelligence

The growing prevalence and rapid evolution of offensive language in social media amplify the complexities of detection, particularly highlighting the challenges in identifying such content across diverse languages. This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning (CLTL) techniques in offensive language detection in social media. Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain. We analyse 67 relevant papers and categorise these studies across various dimensions, including the characteristics of multilingual datasets used, the cross-lingual resources employed, and the specific CLTL strategies implemented. According to "what to transfer", we also summarise three main CLTL transfer approaches: instance, feature, and parameter transfer. Additionally, we shed light on the current challenges and future research opportunities in this field. Furthermore, we have made our survey resources available online, including two comprehensive tables that provide accessible references to the multilingual datasets and CLTL methods used in the reviewed literature.

dataset, detection, proc, (14 more...)

arXiv.org Artificial Intelligence

Jan-17-2024

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - São Paulo (0.04)
  - Rio Grande do Sul > Porto Alegre (0.04)
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - New York
      - New York County > New York City (0.04)
      - Rensselaer County > Troy (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - California > San Diego County
      - San Diego (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - Ontario > Toronto (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Estonia (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Germany
    - Berlin (0.04)
    - North Rhine-Westphalia > Upper Bavaria
      - Munich (0.04)
  - Spain
    - Aragón (0.04)
    - Catalonia > Barcelona Province
      - Barcelona (0.04)
  - United Kingdom > England
    - Greater London > London (0.04)
  - France
    - Île-de-France > Paris
      - Paris (0.04)
    - Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
      - Marseille (0.04)
    - Occitanie > Haute-Garonne
      - Toulouse (0.04)
  - Ukraine > Kyiv Oblast
    - Kyiv (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
  - Japan > Kyūshū & Okinawa
    - Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
  - India
    - West Bengal > Kolkata (0.04)
    - Telangana > Hyderabad (0.04)

Genre:
- Research Report (1.00)
- Overview (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)
- Media > News (0.92)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
- Law (0.67)
- Government (0.67)
- Education (0.67)

Technology:
- Information Technology
  - Data Science > Data Mining (1.00)
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language
      - Text Processing (1.00)
      - Machine Translation (1.00)
      - Large Language Model (1.00)
    - Machine Learning
      - Neural Networks > Deep Learning (1.00)
      - Statistical Learning (0.92)
      - Inductive Learning (0.67)
      - Transfer Learning (0.66)