Jia, Chenyan
Social Media Algorithms Can Shape Affective Polarization via Exposure to Antidemocratic Attitudes and Partisan Animosity
Piccardi, Tiziano, Saveski, Martin, Jia, Chenyan, Hancock, Jeffrey T., Tsai, Jeanne L., Bernstein, Michael
There is widespread concern about the negative impacts of social media feed ranking algorithms on political polarization. Leveraging advancements in large language models (LLMs), we develop an approach to re-rank feeds in real-time to test the effects of content that is likely to polarize: expressions of antidemocratic attitudes and partisan animosity (AAPA). In a preregistered 10-day field experiment on X/Twitter with 1,256 consented participants, we increase or decrease participants' exposure to AAPA in their algorithmically curated feeds. We observe more positive outparty feelings when AAPA exposure is decreased and more negative outparty feelings when AAPA exposure is increased. Exposure to AAPA content also results in an immediate increase in negative emotions, such as sadness and anger. The interventions do not significantly impact traditional engagement metrics such as re-post and favorite rates. These findings highlight a potential pathway for developing feed algorithms that mitigate affective polarization by addressing content that undermines the shared values required for a healthy democracy.
Embedding Democratic Values into Social Media AIs via Societal Objective Functions
Jia, Chenyan, Lam, Michelle S., Mai, Minh Chau, Hancock, Jeff, Bernstein, Michael S.
Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the political science construct of anti-democratic attitudes. Traditionally, we have lacked observable outcomes to use to train such models, however, the social sciences have developed survey instruments and qualitative codebooks for these constructs, and their precision facilitates translation into detailed prompts for large language models. We apply this method to create a democratic attitude model that estimates the extent to which a social media post promotes anti-democratic attitudes, and test this democratic attitude model across three studies. In Study 1, we first test the attitudinal and behavioral effectiveness of the intervention among US partisans (N=1,380) by manually annotating (alpha=.895) social media posts with anti-democratic attitude scores and testing several feed ranking conditions based on these scores. Removal (d=.20) and downranking feeds (d=.25) reduced participants' partisan animosity without compromising their experience and engagement. In Study 2, we scale up the manual labels by creating the democratic attitude model, finding strong agreement with manual labels (rho=.75). Finally, in Study 3, we replicate Study 1 using the democratic attitude model instead of manual labels to test its attitudinal and behavioral impact (N=558), and again find that the feed downranking using the societal objective function reduced partisan animosity (d=.25). This method presents a novel strategy to draw on social science theory and methods to mitigate societal harms in social media AIs.
Training Socially Aligned Language Models on Simulated Social Interactions
Liu, Ruibo, Yang, Ruixin, Jia, Chenyan, Zhang, Ge, Zhou, Denny, Dai, Andrew M., Yang, Diyi, Vosoughi, Soroush
Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values. "We want AI agents that can discover like we can, not which contain what we have discovered." Richard Sutton, The Bitter Lesson, 2019 By virtue of their ability to "predict the next token(s)", contemporary pre-trained language models (LMs) have shown remarkable proficiency in memorizing extensive corpora, thereby enabling the generation of text indistinguishable from human-produced content (Brown et al., 2020). However, successful memorization of human knowledge does not assure a model's propensity to perform as per societal expectations. Recent research has exposed behavioral anomalies in these LMs (Weidinger et al., 2022), which include the generation of harmful content (Gehman et al., 2020; Bommasani et al., 2021), the reinforcement of bias (Venkit et al., 2022; Liu et al., 2022), and the dissemination of disinformation (Tamkin et al., 2021; Lin et al., 2022). This process of enhancing desirable societal behaviors and inhibiting undesirable ones is commonly referred to as "social alignment" (Gabriel, 2020; Taylor et al., 2016). Supervised Fine-Tuning (SFT) presents a straightforward method for achieving alignment by training LMs using socially aligned data (Figure 1 [a]). However, this method often yields models susceptible to adversarial attacks, like "jailbreaking prompting" (Subhash, 2023; Xu et al., 2021), due to limited exposure to misaligned data during training (Amodei et al., 2016). To address this, a more advanced technique, "reward modeling" has been proposed (Leike et al., 2018; Christiano et al., 2017). This involves training a reward model as a surrogate for human judgment to guide the optimization of the LM (e.g., OpenAI's RLHF, Figure 1 [b]).
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits
Liu, Ruibo, Jia, Chenyan, Zhang, Ge, Zhuang, Ziyu, Liu, Tony X, Vosoughi, Soroush
We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.
Mitigating Political Bias in Language Models Through Reinforced Calibration
Liu, Ruibo, Jia, Chenyan, Wei, Jason, Xu, Guangxuan, Wang, Lili, Vosoughi, Soroush
Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings. In this paper, we describe metrics for measuring political bias in GPT-2 generation and propose a reinforcement learning (RL) framework for mitigating political biases in generated text. By using rewards from word embeddings or a classifier, our RL framework guides debiased generation without having access to the training data or requiring the model to be retrained. In empirical experiments on three attributes sensitive to political bias (gender, location, and topic), our methods reduced bias according to both our metrics and human evaluation, while maintaining readability and semantic coherence.
Political Depolarization of News Articles Using Attribute-aware Word Embeddings
Liu, Ruibo, Wang, Lili, Jia, Chenyan, Vosoughi, Soroush
Political polarization in the US is on the rise. This polarization negatively affects the public sphere by contributing to the creation of ideological echo chambers. In this paper, we focus on addressing one of the factors that contributes to this polarity, polarized media. We introduce a framework for depolarizing news articles. Given an article on a certain topic with a particular ideological slant (eg., liberal or conservative), the framework first detects polar language in the article and then generates a new article with the polar language replaced with neutral expressions. To detect polar words, we train a multi-attribute-aware word embedding model that is aware of ideology and topics on 360k full-length media articles. Then, for text generation, we propose a new algorithm called Text Annealing Depolarization Algorithm (TADA). TADA retrieves neutral expressions from the word embedding model that not only decrease ideological polarity but also preserve the original argument of the text, while maintaining grammatical correctness. We evaluate our framework by comparing the depolarized output of our model in two modes, fully-automatic and semi-automatic, on 99 stories spanning 11 topics. Based on feedback from 161 human testers, our framework successfully depolarized 90.1% of paragraphs in semi-automatic mode and 78.3% of paragraphs in fully-automatic mode. Furthermore, 81.2% of the testers agree that the non-polar content information is well-preserved and 79% agree that depolarization does not harm semantic correctness when they compare the original text and the depolarized text. Our work shows that data-driven methods can help to locate political polarity and aid in the depolarization of articles.