user opinion
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
Wang, Keyu, Li, Jin, Yang, Shu, Zhang, Zhuoran, Wang, Di
Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.
Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews
Liu, Mengqiao, Wang, Tevin, Cohen, Cassandra A., Li, Sarah, Xiong, Chenyan
Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interacted with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, for example, the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our collected chat-and-interview logs will be released.
Towards Opinion Shaping: A Deep Reinforcement Learning Approach in Bot-User Interactions
Siahkali, Farbod, Samadi, Saba, Kebriaei, Hamed
This paper aims to investigate the impact of interference in social network algorithms via user-bot interactions, focusing on the Stochastic Bounded Confidence Model (SBCM). This paper explores two approaches: positioning bots controlled by agents into the network and targeted advertising under various circumstances, operating with an advertising budget. This study integrates the Deep Deterministic Policy Gradient (DDPG) algorithm and its variants to experiment with different Deep Reinforcement Learning (DRL). Finally, experimental results demonstrate that this approach can result in efficient opinion shaping, indicating its potential in deploying advertising resources on social platforms.
USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations
Marreddy, Mounika, Oota, Subba Reddy, Chinni, Venkata Charan, Gupta, Manish, Flek, Lucie
Identifying user's opinions and stances in long conversation threads on various topics can be extremely critical for enhanced personalization, market research, political campaigns, customer service, conflict resolution, targeted advertising, and content moderation. Hence, training language models to automate this task is critical. However, to train such models, gathering manual annotations has multiple challenges: 1) It is time-consuming and costly; 2) Conversation threads could be very long, increasing chances of noisy annotations; and 3) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Inspired by the recent success of large language models (LLMs) for complex natural language processing (NLP) tasks, we leverage Mistral Large and GPT-4 to automate the human annotation process on the following two tasks while also providing reasoning: i) User Stance classification, which involves labeling a user's stance of a post in a conversation on a five-point scale; ii) User Dogmatism classification, which deals with labeling a user's overall opinion in the conversation on a four-point scale. The majority voting on zero-shot, one-shot, and few-shot annotations from these two LLMs on 764 multi-user Reddit conversations helps us curate the USDC dataset. USDC is then used to finetune and instruction-tune multiple deployable small language models for the 5-class stance and 4-class dogmatism classification tasks. We make the code and dataset publicly available [https://anonymous.4open.science/r/USDC-0F7F].
Extracting Entities of Interest from Comparative Product Reviews
Arora, Jatin, Agrawal, Sumit, Goyal, Pawan, Pathak, Sayan
This paper presents a deep learning based approach to extract product comparison information out of user reviews on various e-commerce websites. Any comparative product review has three major entities of information: the names of the products being compared, the user opinion (predicate) and the feature or aspect under comparison. All these informing entities are dependent on each other and bound by the rules of the language, in the review. We observe that their inter-dependencies can be captured well using LSTMs. We evaluate our system on existing manually labeled datasets and observe out-performance over the existing Semantic Role Labeling (SRL) framework popular for this task.
Simple synthetic data reduces sycophancy in large language models
Wei, Jerry, Huang, Da, Lu, Yifeng, Zhou, Denny, Le, Quoc V.
Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.
Aligning Language Models to User Opinions
Hwang, EunJeong, Majumder, Bodhisattwa Prasad, Tandon, Niket
An important aspect of developing LLMs that interact with humans is to align models' behavior to their users. It is possible to prompt an LLM into behaving as a certain persona, especially a user group or ideological persona the model captured during its pertaining stage. But, how to best align an LLM with a specific user and not a demographic or ideological group remains an open question. Mining public opinion surveys (by Pew Research), we find that the opinions of a user and their demographics and ideologies are not mutual predictors. We use this insight to align LLMs by modeling both user opinions as well as user demographics and ideology, achieving up to 7 points accuracy gains in predicting public opinions from survey questions across a broad set of topics. In addition to the typical approach of prompting LLMs with demographics and ideology, we discover that utilizing the most relevant past opinions from individual users enables the model to predict user opinions more accurately.
A Machine Learning Pipeline to Examine Political Bias with Congressional Speeches
Machine learning, with advancements in natural language processing and deep learning, has been actively used in studying political bias on social media. But the key challenge to model political bias is the requirement of human effort to label the seed social media posts to train machine learning models. Although very effective, this approach has disadvantages in the time-consuming data labeling process and the cost to label significant data for machine learning models is significantly higher. The web offers invaluable data on political bias starting from biased news media outlets publishing articles on socio-political issues to biased user discussions about several topics in multiple social forums. In this work, we introduce a novel approach to label political bias for social media posts directly from US congressional speeches without any human intervention for downstream machine learning models.
Dimensions of Transparency in NLP Applications
Saxon, Michael, Levy, Sharon, Wang, Xinyi, Albalak, Alon, Wang, William Yang
Broader transparency in descriptions of and communication regarding AI systems is widely considered desirable. This is particularly the case in discussions of fairness and accountability in systems exposed to the general public. However, previous work has suggested that a trade-off exists between greater system transparency and user confusion, where `too much information' clouds a reader's understanding of what a system description means. Unfortunately, transparency is a nebulous concept, difficult to both define and quantify. In this work we address these two issues by proposing a framework for quantifying transparency in system descriptions and apply it to analyze the trade-off between transparency and end-user confusion using NLP conference abstracts.
The Incentives that Shape Behaviour
Carey, Ryan, Langlois, Eric, Everitt, Tom, Legg, Shane
Which variables does an agent have an incentive to control with its decision, and which variables does it have an incentive to respond to? We formalize these incentives, and demonstrate unique graphical criteria for detecting them in any single-decision causal influence diagram. To this end, we introduce structural causal influence models, a hybrid of the influence diagram and structural causal model frameworks. Finally, we illustrate how these incentives predict agent incentives in both fairness and AI safety applications.