reddit comment
Analyzing User Perceptions of Large Language Models (LLMs) on Reddit: Sentiment and Topic Modeling of ChatGPT and DeepSeek Discussions
While there is an increased discourse on large language models (LLMs) like ChatGPT and DeepSeek, there is no comprehensive understanding of how users of online platforms, like Reddit, perceive these models. This is an important omission because public opinion can influence AI development, trust, and future policy. This study aims at analyzing Reddit discussions about ChatGPT and DeepSeek using sentiment and topic modeling to advance the understanding of user attitudes. Some of the significant topics such as trust in AI, user expectations, potential uses of the tools, reservations about AI biases, and ethical implications of their use are explored in this study. By examining these concerns, the study provides a sense of how public sentiment might shape the direction of AI development going forward. The report also mentions whether users have faith in the technology and what they see as its future. A word frequency approach is used to identify broad topics and sentiment trends. Also, topic modeling through the Latent Dirichlet Allocation (LDA) method identifies top topics in users' language, for example, potential benefits of LLMs, their technological applications, and their overall social ramifications. The study aims to inform developers and policymakers by making it easier to see how users comprehend and experience these game-changing technologies.
Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments
Ali, Hasin Jawad, Abrar, Ajwad, Hossain, S. M. Hozaifa, Mridha, M. Firoz
In politically sensitive scenarios like wars, social media serves as a platform for polarized discourse and expressions of strong ideological stances. While prior studies have explored ideological stance detection in general contexts, limited attention has been given to conflict-specific settings. This study addresses this gap by analyzing 9,969 Reddit comments related to the Israel-Palestine conflict, collected between October 2023 and August 2024. The comments were categorized into three stance classes: Pro-Israel, Pro-Palestine, and Neutral. Various approaches, including machine learning, pre-trained language models, neural networks, and prompt engineering strategies for open source large language models (LLMs), were employed to classify these stances. Performance was assessed using metrics such as accuracy, precision, recall, and F1-score. Among the tested methods, the Scoring and Reflective Re-read prompt in Mixtral 8x7B demonstrated the highest performance across all metrics. This study provides comparative insights into the effectiveness of different models for detecting ideological stances in highly polarized social media contexts. The dataset used in this research is publicly available for further exploration and validation.
Song Emotion Classification of Lyrics with Out-of-Domain Data under Label Scarcity
Sakunkoo, Jonathan, Sakunkoo, Annabella
Songs have been found to profoundly impact human emotions, with lyrics having significant power to stimulate emotional changes in the audience. There is a scarcity of large, high quality in-domain datasets for lyrics-based song emotion classification (Edmonds and Sedoc, 2021; Zhou, 2022). It has been noted that in-domain training datasets are often difficult to acquire (Zhang and Miao, 2023) and that label acquisition is often limited by cost, time, and other factors (Azad et al., 2018). We examine the novel usage of a large out-of-domain dataset as a creative solution to the challenge of training data scarcity in the emotional classification of song lyrics. We find that CNN models trained on a large Reddit comments dataset achieve satisfactory performance and generalizability to lyrical emotion classification, thus giving insights into and a promising possibility in leveraging large, publicly available out-of-domain datasets for domains whose in-domain data are lacking or costly to acquire.
Analysing the Public Discourse around OpenAI's Text-To-Video Model 'Sora' using Topic Modeling
Announced on February 15, 2024, it instantly caught the public's attention by demonstrating the ability to generate dynamic and realistic video clips from text prompts, similar to how OpenAI's DALL-E generates images from text. While Sora is still in a pre-release phase, its potential to revolutionize content creation and disrupt various industries be it media, entertainment, or advertising, has already ignited discussions across online communities. Subreddits such as r/OpenAI, r/technology and r/ChatGPT have emerged as epicentres for technology enthusiasts and critics to openly discuss and share narratives about the latest advancements in AI technologies. Previous studies have explored public perceptions of large language models like ChatGPT and image generators such as DALL-E through analysing online forums. For instance, Talafidaryani and Mora (2024) employed topic modeling techniques on Reddit data to uncover dominant themes surrounding ChatGPT, including its capabilities, limitations, and ethical considerations. Similarly, Zhou and Nabus (2023) investigated discussions on DALL-E, revealing discourse on creative applications, risks of misuse, and comparisons to human artists. However, due to Sora's relatively recent emergence, there is still a lack of research on the narratives and themes emerging from Reddit conversations about this novel technology. By conducting topic modeling analysis on a large corpus of Reddit comments, the study aims to feel that gap and uncover the main topics and themes users are discussing about Sora. These narratives can provide valuable insights into public perceptions, areas of excitement, as well as societal and ethical concerns surrounding around the advent of new generative AI technologies.
Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets
Pan, Jinsheng, Wang, Zichen, Qi, Weihong, Lyu, Hanjia, Luo, Jiebo
Understanding the framing of political issues is of paramount importance as it significantly shapes how individuals perceive, interpret, and engage with these matters. While prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media's treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.
Lived Experience Matters: Automatic Detection of Stigma on Social Media Toward People Who Use Substances
Giorgi, Salvatore, Bellew, Douglas, Habib, Daniel Roy Sadek, Sherman, Garrick, Sedoc, Joao, Smitterberg, Chase, Devoto, Amanda, Himelein-Wachowiak, McKenzie, Curtis, Brenda
Stigma toward people who use substances (PWUS) is a leading barrier to seeking treatment.Further, those in treatment are more likely to drop out if they experience higher levels of stigmatization. While related concepts of hate speech and toxicity, including those targeted toward vulnerable populations, have been the focus of automatic content moderation research, stigma and, in particular, people who use substances have not. This paper explores stigma toward PWUS using a data set of roughly 5,000 public Reddit posts. We performed a crowd-sourced annotation task where workers are asked to annotate each post for the presence of stigma toward PWUS and answer a series of questions related to their experiences with substance use. Results show that workers who use substances or know someone with a substance use disorder are more likely to rate a post as stigmatizing. Building on this, we use a supervised machine learning framework that centers workers with lived substance use experience to label each Reddit post as stigmatizing. Modeling person-level demographics in addition to comment-level language results in a classification accuracy (as measured by AUC) of 0.69 -- a 17% increase over modeling language alone. Finally, we explore the linguist cues which distinguish stigmatizing content: PWUS substances and those who don't agree that language around othering ("people", "they") and terms like "addict" are stigmatizing, while PWUS (as opposed to those who do not) find discussions around specific substances more stigmatizing. Our findings offer insights into the nature of perceived stigma in substance use. Additionally, these results further establish the subjective nature of such machine learning tasks, highlighting the need for understanding their social contexts.
Happenstance: Utilizing Semantic Search to Track Russian State Media Narratives about the Russo-Ukrainian War On Reddit
Hanley, Hans W. A., Kumar, Deepak, Durumeric, Zakir
In the buildup to and in the weeks following the Russian Federation's invasion of Ukraine, Russian state media outlets output torrents of misleading and outright false information. In this work, we study this coordinated information campaign in order to understand the most prominent state media narratives touted by the Russian government to English-speaking audiences. To do this, we first perform sentence-level topic analysis using the large-language model MPNet on articles published by ten different pro-Russian propaganda websites including the new Russian "fact-checking" website waronfakes.com. Within this ecosystem, we show that smaller websites like katehon.com were highly effective at publishing topics that were later echoed by other Russian sites. After analyzing this set of Russian information narratives, we then analyze their correspondence with narratives and topics of discussion on the r/Russia and 10 other political subreddits. Using MPNet and a semantic search algorithm, we map these subreddits' comments to the set of topics extracted from our set of Russian websites, finding that 39.6% of r/Russia comments corresponded to narratives from pro-Russian propaganda websites compared to 8.86% on r/politics.
Transgender Community Sentiment Analysis from Social Media Data: A Natural Language Processing Approach
Liu, Yuqiao, Wang, Yudan, Zhao, Ying, Li, Zhixiang
Transgender community is experiencing a huge disparity in mental health conditions compared with the general population. Interpreting the social medial data posted by transgender people may help us understand the sentiments of these sexual minority groups better and apply early interventions. In this study, we manually categorize 300 social media comments posted by transgender people to the sentiment of negative, positive, and neutral. 5 machine learning algorithms and 2 deep neural networks are adopted to build sentiment analysis classifiers based on the annotated data. Results show that our annotations are reliable with a high Cohen's Kappa score over 0.8 across all three classes. LSTM model yields an optimal performance of accuracy over 0.85 and AUC of 0.876. Our next step will focus on using advanced natural language processing algorithms on a larger annotated dataset.