Goto

Collaborating Authors

 sycophantic response


'Sycophantic' AI chatbots tell users what they want to hear, study shows

The Guardian

Stanford University researchers found that AI chatbots reinforced existing beliefs, assumptions and decisions. Stanford University researchers found that AI chatbots reinforced existing beliefs, assumptions and decisions. 'Sycophantic' AI chatbots tell users what they want to hear, study shows Scientists warn of'insidious risks' of increasingly popular technology that affirms even harmful behaviour Turning to AI chatbots for personal advice poses "insidious risks", according to a study showing the technology consistently affirms a user's actions and opinions even when harmful. Scientists said the findings raised urgent concerns over the power of chatbots to distort people's self-perceptions and make them less willing to patch things up after a row. With chatbots becoming a major source of advice on relationships and other personal issues, they could "reshape social interactions at scale", the researchers added, calling on developers to address this risk.


Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Pandey, Sanskar, Chopra, Ruhaan, Puniya, Angkul, Pal, Sohom

arXiv.org Artificial Intelligence

Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.


Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Model

Carro, María Victoria

arXiv.org Artificial Intelligence

Sycophancy refers to the tendency of a large language model to align its outputs with the user's perceived preferences, beliefs, or opinions, in order to look favorable, regardless of whether those statements are factually correct. This behavior can lead to undesirable consequences, such as reinforcing discriminatory biases or amplifying misinformation. Given that sycophancy is often linked to human feedback training mechanisms, this study explores whether sycophantic tendencies negatively impact user trust in large language models or, conversely, whether users consider such behavior as favorable. To investigate this, we instructed one group of participants to answer ground-truth questions with the assistance of a GPT specifically designed to provide sycophantic responses, while another group used the standard version of ChatGPT. Initially, participants were required to use the language model, after which they were given the option to continue using it if they found it trustworthy and useful. Trust was measured through both demonstrated actions and self-reported perceptions. The findings consistently show that participants exposed to sycophantic behavior reported and exhibited lower levels of trust compared to those who interacted with the standard version of the model, despite the opportunity to verify the accuracy of the model's output.


Towards Understanding Sycophancy in Language Models

Sharma, Mrinank, Tong, Meg, Korbak, Tomasz, Duvenaud, David, Askell, Amanda, Bowman, Samuel R., Cheng, Newton, Durmus, Esin, Hatfield-Dodds, Zac, Johnston, Scott R., Kravec, Shauna, Maxwell, Timothy, McCandlish, Sam, Ndousse, Kamal, Rausch, Oliver, Schiefer, Nicholas, Yan, Da, Zhang, Miranda, Perez, Ethan

arXiv.org Machine Learning

Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.