BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service

Okulska, Inez, Głąbińska, Kinga, Kołos, Anna, Karlińska, Agnieszka, Wiśnios, Emilia, Nowakowski, Adam, Ellerik, Paweł, Prałat, Andrzej

Aug-23-2023–arXiv.org Artificial Intelligence

Advances in automated detection of offensive language online, including hate speech and cyberbullying, require improved access to publicly available datasets comprising social media content. In this paper, we introduce BAN-PL, the first open dataset in the Polish language that encompasses texts flagged as harmful and subsequently removed by professional moderators. The dataset encompasses a total of 691,662 pieces of content from a popular social networking service, Wykop, often referred to as the "Polish Reddit", including both posts and comments, and is evenly distributed into two distinct classes: "harmful" and "neutral". We provide a comprehensive description of the data collection and preprocessing procedures, as well as highlight the linguistic specificity of the data. The BAN-PL dataset, along with advanced preprocessing scripts for, i.a., unmasking profanities, will be publicly available.

artificial intelligence, machine learning, social media, (17 more...)

arXiv.org Artificial Intelligence

Aug-23-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Russia (0.04)
- Europe
  - Russia (0.04)
  - Ukraine (0.04)
  - United Kingdom > England
    - Leicestershire > Leicester (0.04)
  - Poland
    - Lesser Poland Province > Kraków (0.04)
    - Greater Poland Province > Poznań (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)

Genre:
- Research Report (0.64)

Industry:
- Media > News (0.88)
- Information Technology > Security & Privacy (0.67)
- Health & Medicine > Therapeutic Area
  - Psychiatry/Psychology (0.46)

Technology:
- Information Technology
  - Communications > Social Media (1.00)
  - Artificial Intelligence > Machine Learning
    - Neural Networks (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found