Jin, Mali
Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting MPs
Pandya, Mugdha, Jin, Mali, Bontcheva, Kalina, Maynard, Diana
Numerous politicians use social media platforms, particularly X, to engage with their constituents. This interaction allows constituents to pose questions and offer feedback but also exposes politicians to a barrage of hostile responses, especially given the anonymity afforded by social media. They are typically targeted in relation to their governmental role, but the comments also tend to attack their personal identity. This can discredit politicians and reduce public trust in the government. It can also incite anger and disrespect, leading to offline harm and violence. While numerous models exist for detecting hostility in general, they lack the specificity required for political contexts. Furthermore, addressing hostility towards politicians demands tailored approaches due to the distinct language and issues inherent to each country (e.g., Brexit for the UK). To bridge this gap, we construct a dataset of 3,320 English tweets spanning a two-year period manually annotated for hostility towards UK MPs. Our dataset also captures the targeted identity characteristics (race, gender, religion, none) in hostile tweets. We perform linguistic and topical analyses to delve into the unique content of the UK political data. Finally, we evaluate the performance of pre-trained language models and large language models on binary hostility detection and multi-class targeted identity type classification tasks. Our study offers valuable data and insights for future research on the prevalence and nature of politics-related hostility specific to the UK.
Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research
Mu, Yida, Jin, Mali, Song, Xingyi, Aletras, Nikolaos
Research in natural language processing (NLP) for Computational Social Science (CSS) heavily relies on data from social media platforms. This data plays a crucial role in the development of models for analysing socio-linguistic phenomena within online communities. In this work, we conduct an in-depth examination of 20 datasets extensively used in NLP for CSS to comprehensively examine data quality. Our analysis reveals that social media datasets exhibit varying levels of data duplication. Consequently, this gives rise to challenges like label inconsistencies and data leakage, compromising the reliability of models. Our findings also suggest that data duplication has an impact on the current claims of state-of-the-art performance, potentially leading to an overestimation of model effectiveness in real-world scenarios. Finally, we propose new protocols and best practices for improving dataset development from social media data and its usage.
Dimensions of Online Conflict: Towards Modeling Agonism
Canute, Matt, Jin, Mali, holtzclaw, hannah, Lusoli, Alberto, Adams, Philippa R, Pandya, Mugdha, Taboada, Maite, Maynard, Diana, Chun, Wendy Hui Kyong
Agonism plays a vital role in democratic dialogue by fostering diverse perspectives and robust discussions. Within the realm of online conflict there is another type: hateful antagonism, which undermines constructive dialogue. Detecting conflict online is central to platform moderation and monetization. It is also vital for democratic dialogue, but only when it takes the form of agonism. To model these two types of conflict, we collected Twitter conversations related to trending controversial topics. We introduce a comprehensive annotation schema for labelling different dimensions of conflict in the conversations, such as the source of conflict, the target, and the rhetorical strategies deployed. Using this schema, we annotated approximately 4,000 conversations with multiple labels. We then trained both logistic regression and transformer-based models on the dataset, incorporating context from the conversation, including the number of participants and the structure of the interactions. Results show that contextual labels are helpful in identifying conflict and make the models robust to variations in topic. Our research contributes a conceptualization of different dimensions of conflict, a richly annotated dataset, and promising results that can contribute to content moderation.
Examining Temporal Bias in Abusive Language Detection
Jin, Mali, Mu, Yida, Maynard, Diana, Bontcheva, Kalina
Previous work identified temporal bias in an Italian hate In recent years, researchers have developed a huge variety speech data set associated with immigrants (Florio et al. of machine learning models that can automatically detect 2020). However, they have yet to explore temporal factors abusive language (Mishra et al. 2019; Aurpa, Sadik, and affecting predictive performance from a multilingual perspective. Ahmed 2022; Das and Mukherjee 2023; Alrashidi, Jamal, In this paper, we explore temporal bias in 5 different and Alkhathlan 2023). However, these models may be subject abusive data sets that span varying time periods, in 4 to temporal bias, which can lead to a decrease in the languages (English, Spanish, Italian, and Chinese). Specifically, accuracy of abusive language detection models, potentially we investigate the following core research questions: allowing abusive language to be undetected or falsely detected. RQ1: How does the magnitude of temporal bias vary across different data sets such as language, time span and Temporal bias arises from differences in populations and collection methods?
Examining Temporalities on Stance Detection towards COVID-19 Vaccination
Mu, Yida, Jin, Mali, Bontcheva, Kalina, Song, Xingyi
Previous studies have highlighted the importance of vaccination as an effective strategy to control the transmission of the COVID-19 virus. It is crucial for policymakers to have a comprehensive understanding of the public's stance towards vaccination on a large scale. However, attitudes towards COVID-19 vaccination, such as pro-vaccine or vaccine hesitancy, have evolved over time on social media. Thus, it is necessary to account for possible temporal shifts when analysing these stances. This study aims to examine the impact of temporal concept drift on stance detection towards COVID-19 vaccination on Twitter. To this end, we evaluate a range of transformer-based models using chronological (split the training, validation and testing sets in the order of time) and random splits (randomly split these three sets) of social media data. Our findings demonstrate significant discrepancies in model performance when comparing random and chronological splits across all monolingual and multilingual datasets. Chronological splits significantly reduce the accuracy of stance classification. Therefore, real-world stance detection approaches need to be further refined to incorporate temporal factors as a key consideration.
VaxxHesitancy: A Dataset for Studying Hesitancy towards COVID-19 Vaccination on Twitter
Mu, Yida, Jin, Mali, Grimshaw, Charlie, Scarton, Carolina, Bontcheva, Kalina, Song, Xingyi
Vaccine hesitancy has been a common concern, probably since vaccines were created and, with the popularisation of social media, people started to express their concerns about vaccines online alongside those posting pro- and anti-vaccine content. Predictably, since the first mentions of a COVID-19 vaccine, social media users posted about their fears and concerns or about their support and belief into the effectiveness of these rapidly developing vaccines. Identifying and understanding the reasons behind public hesitancy towards COVID-19 vaccines is important for policy markers that need to develop actions to better inform the population with the aim of increasing vaccine take-up. In the case of COVID-19, where the fast development of the vaccines was mirrored closely by growth in anti-vaxx disinformation, automatic means of detecting citizen attitudes towards vaccination became necessary. This is an important computational social sciences task that requires data analysis in order to gain in-depth understanding of the phenomena at hand. Annotated data is also necessary for training data-driven models for more nuanced analysis of attitudes towards vaccination. To this end, we created a new collection of over 3,101 tweets annotated with users' attitudes towards COVID-19 vaccination (stance). Besides, we also develop a domain-specific language model (VaxxBERT) that achieves the best predictive performance (73.0 accuracy and 69.3 F1-score) as compared to a robust set of baselines. To the best of our knowledge, these are the first dataset and model that model vaccine hesitancy as a category distinct from pro- and anti-vaccine stance.