Synthetic vs. Gold: The Role of LLM-Generated Labels and Data in Cyberbullying Detection

Kazemi, Arefeh, Kalaivendan, Sri Balaaji Natarajan, Wagner, Joachim, Qadeer, Hamza, Davis, Brian

Feb-21-2025–arXiv.org Artificial Intelligence

This study investigates the role of LLM-generated synthetic data in cyberbullying detection. We conduct a series of experiments where we replace some or all of the authentic data with synthetic data, or augment the authentic data with synthetic data. We find that synthetic cyberbullying data can be the basis for training a classifier for harm detection that reaches performance close to that of a classifier trained with authentic data. Combining authentic with synthetic data shows improvements over the baseline of training on authentic data alone for the test data for all three LLMs tried. These results highlight the viability of synthetic data as a scalable, ethically viable alternative in cyberbullying detection while emphasizing the critical impact of LLM selection on performance outcomes.

authentic data, dataset, synthetic data, (16 more...)

arXiv.org Artificial Intelligence

Feb-21-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States (0.04)
  - Dominican Republic (0.04)
- Europe
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Machine Learning
      - Neural Networks > Deep Learning (0.46)
      - Learning Graphical Models > Directed Networks
        Bayesian Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found