Empirical Analysis of Multi-T ask Learning for Reducing Model Bias in T oxic Comment Detection Ameya V aidya, 1 Feng Mai, 2 Y ue Ning 3 1 Bridgewater-Raritan Regional High School 2 School of Business, Stevens Institute of Technology 3 Department of Computer Science, Stevens Institute of Technology email@example.com, Abstract With the recent rise of toxicity in online conversations on social media platforms, using modern machine learning algorithms for toxic comment detection has become a central focus of many online applications. Researchers and companies have developed a variety of shallow and deep learning models to identify toxicity in online conversations, reviews, or comments with mixed successes. However, these existing approaches have learned to incorrectly associate nontoxic comments that have certain trigger-words (e.g. In this paper, we evaluate dozens of state-of-the-art models with the specific focus of reducing model bias towards these commonly-attacked identity groups. We propose a multi-task learning model with an attention layer that jointly learns to predict the toxicity of a comment as well as the identities present in the comments in order to reduce this bias. We then compare our model to an array of shallow and deep-learning models using metrics designed especially to test for unintended model bias within these identity groups. Introduction The identification of potential toxicity within online conversations has always been a significant task for current platform providers. Toxic comments have the unfortunate effect of causing users to leave a discussion or give up sharing their perspective and can give a bad reputation to platforms where these discussions take place. Twitter's CEO reaffirmed that Twitter is still being overrun by spam, abuse, and misinformation. Current research involves investigating common challenges in toxic comment classification (van Aken et al. 2018), identifying subtle forms of toxicity (Noever 2018), detecting early signs of toxicity (Zhang et al. 2018), and analysing sarcasm within conversations (Ghosh, Fabbri, and Muresan 2018).
Two years ago, Toxic Comment Classification Challenge was published on Kaggle. Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. In this post, we develop a tool that is able to recognize toxicity in comments.
The spectacular expansion of the Internet led to the development of a new research problem in the natural language processing field: automatic toxic comment detection, since many countries prohibit hate speech in public media. There is no clear and formal definition of hate, offensive, toxic and abusive speeches. In this article, we put all these terms under the "umbrella" of toxic speech. The contribution of this paper is the design of binary classification and regression-based approaches aiming to predict whether a comment is toxic or not. We compare different unsupervised word representations and different DNN classifiers. Moreover, we study the robustness of the proposed approaches to adversarial attacks by adding one (healthy or toxic) word. We evaluate the proposed methodology on the English Wikipedia Detox corpus. Our experiments show that using BERT fine-tuning outperforms feature-based BERT, Mikolov's word embedding or fastText representations with different DNN classifiers.
Adversarial examples are important for understanding the behavior of neural models, and can improve their robustness through adversarial training. Recent work in natural language processing generated adversarial examples by assuming white-box access to the attacked model, and optimizing the input directly against it (Ebrahimi et al., 2018). In this work, we show that the knowledge implicit in the optimization procedure can be distilled into another more efficient neural network. We train a model to emulate the behavior of a white-box attack and show that it generalizes well across examples. Moreover, it reduces adversarial example generation time by 19x-39x. We also show that our approach transfers to a black-box setting, by attacking The Google Perspective API and exposing its vulnerability. Our attack flips the API-predicted label in 42\% of the generated examples, while humans maintain high-accuracy in predicting the gold label.
To identify and classify toxic online commentary, the modern tools of data science transform raw text into key features from which either thresholding or learning algorithms can make predictions for monitoring offensive conversations. We systematically evaluate 62 classifiers representing 19 major algorithmic families against features extracted from the Jigsaw dataset of Wikipedia comments. We compare the classifiers based on statistically significant differences in accuracy and relative execution time. Among these classifiers for identifying toxic comments, tree-based algorithms provide the most transparently explainable rules and rank-order the predictive contribution of each feature. Among 28 features of syntax, sentiment, emotion and outlier word dictionaries, a simple bad word list proves most predictive of offensive commentary. Introduction In 2015, the Twitter CEO, Dick Costello, took personal responsibility for online harassment, trolling and abuse on the Twitter platform.