Withtheadventofsocialmedia,therehasbeenanextremely rapid increase in the content shared online. Consequently, the propagation of fake news and hostile messages on social media platforms has also skyrocketed. In this paper, we address the problem of detecting hostile and fake content in the Devanagari (Hindi) script as a multi-class, multi-label problem. Using NLP techniques, we build a model that makes use of an abusive language detector coupled with features extracted via Hindi BERT and Hindi FastText models and metadata. Our model achieves a 0.97 F1 score on coarse grain evaluation on Hostility detection task. Additionally, we built models to identify fake news related to Covid-19 in English tweets. We leverage entity information extracted from the tweets along with textual representations learned from word embeddings and achieve a 0.93 F1 score on the English fake news detection task.
Hate speech is a major issue in social networks due to the high volume of data generated daily. Recent works demonstrate the usefulness of machine learning (ML) in dealing with the nuances required to distinguish between hateful posts from just sarcasm or offensive language. Many ML solutions for hate speech detection have been proposed by either changing how features are extracted from the text or the classification algorithm employed. However, most works consider only one type of feature extraction and classification algorithm. This work argues that a combination of multiple feature extraction techniques and different classification models is needed. We propose a framework to analyze the relationship between multiple feature extraction and classification techniques to understand how they complement each other. The framework is used to select a subset of complementary techniques to compose a robust multiple classifiers system (MCS) for hate speech detection. The experimental study considering four hate speech classification datasets demonstrates that the proposed framework is a promising methodology for analyzing and designing high-performing MCS for this task. MCS system obtained using the proposed framework significantly outperforms the combination of all models and the homogeneous and heterogeneous selection heuristics, demonstrating the importance of having a proper selection scheme. Source code, figures, and dataset splits can be found in the GitHub repository: https://github.com/Menelau/Hate-Speech-MCS.
The scope of a lucrative career promoted by Google through its video distribution platform YouTube has attracted a large number of users to become content creators. An important aspect of this line of work is the feedback received in the form of comments which show how well the content is being received by the audience. However, volume of comments coupled with spam and limited tools for comment classification makes it virtually impossible for a creator to go through each and every comment and gather constructive feedback. Automatic classification of comments is a challenge even for established classification models, since comments are often of variable lengths riddled with slang, symbols and abbreviations. This is a greater challenge where comments are multilingual as the messages are often rife with the respective vernacular. In this work, we have evaluated top-performing classification models for classifying comments which are a mix of different combinations of English and Malayalam (only English, only Malayalam and Mix of English and Malayalam). The statistical analysis of results indicates that Multinomial Naive Bayes, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest and Decision Trees offer similar level of accuracy in comment classification. Further, we have also evaluated 3 multilingual transformer based language models (BERT, DISTILBERT and XLM) and compared their performance to the traditional machine learning classification techniques. XLM was the top-performing BERT model with an accuracy of 67.31. Random Forest with Term Frequency Vectorizer was the best performing model out of all the traditional classification models with an accuracy of 63.59.
An increase in the use of smartphones has laid to the use of the internet and social media platforms. The most commonly used social media platforms are Twitter, Facebook, WhatsApp and Instagram. People are sharing their personal experiences, reviews, feedbacks on the web. The information which is available on the web is unstructured and enormous. Hence, there is a huge scope of research on understanding the sentiment of the data available on the web. Sentiment Analysis (SA) can be carried out on the reviews, feedbacks, discussions available on the web. There has been extensive research carried out on SA in the English language, but data on the web also contains different other languages which should be analyzed. This paper aims to analyze, review and discuss the approaches, algorithms, challenges faced by the researchers while carrying out the SA on Indigenous languages.
With the rise of social media like Twitter and of software distribution platforms like app stores, users got various ways to express their opinion about software products. Popular software vendors get user feedback thousandfold per day. Research has shown that such feedback contains valuable information for software development teams such as problem reports or feature and support inquires. Since the manual analysis of user feedback is cumbersome and hard to manage many researchers and tool vendors suggested to use automated analyses based on traditional supervised machine learning approaches. In this work, we compare the results of traditional machine learning and deep learning in classifying user feedback in English and Italian into problem reports, inquiries, and irrelevant. Our results show that using traditional machine learning, we can still achieve comparable results to deep learning, although we collected thousands of labels.