sycophancy
When AI Gives Advice: Evaluating AI and Human Responses to Online Advice-Seeking for Well-Being
Kumar, Harsh, Chahal, Jasmine, Zhao, Yinuo, Zhang, Zeling, Wei, Annika, Tay, Louis, Anderson, Ashton
Seeking advice is a core human behavior that the Internet has reinvented twice: first through forums and Q\&A communities that crowdsource public guidance, and now through large language models (LLMs) that deliver private, on-demand counsel at scale. Yet the quality of this synthesized LLM advice remains unclear. How does it compare, not only against arbitrary human comments, but against the wisdom of the online crowd? We conducted two studies (N = 210) in which experts compared top-voted Reddit advice with LLM-generated advice. LLMs ranked significantly higher overall and on effectiveness, warmth, and willingness to seek advice again. GPT-4o beat GPT-5 on all metrics except sycophancy, suggesting that benchmark gains need not improve advice-giving. In our second study, we examined how human and algorithmic advice could be combined, and found that human advice can be unobtrusively polished to compete with AI-generated comments. Finally, to surface user expectations, we ran an exploratory survey with undergraduates (N=148) that revealed heterogeneous, persona-dependent preferences for agent qualities (e.g., coach-like: goal-focused structure; friend-like: warmth and humor). We conclude with design implications for advice-giving agents and ecosystems blending AI, crowd input, and expert oversight.
- North America > Canada > Ontario > Toronto (0.16)
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Germany > Hamburg (0.04)
- (7 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Research Report > Strength High (0.93)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.47)
- Education > Educational Setting > Higher Education (0.46)
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
Bărbălau, Antonio, Păduraru, Cristian Daniel, Poncu, Teodor, Tifrea, Alexandru, Burceanu, Elena
Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretabil-ity and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. In contrast to existing literature, we forward an encoder-centric alternative to model steering which demonstrates a stronger cross-modal performance. We introduce S&P T op-K, a retraining-free and computationally lightweight Selection and Projection framework that identifies T op-K encoder features aligned with a sensitive attribute or behavior, optionally aggregates them into a single control axis, and computes an orthogonal projection to be subsequently applied directly in the model's native embedding space. In vision-language models, it improves fairness metrics on CelebA and FairFace by up to 3.2 times over conventional SAE usage, and in large language models, it substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6 times gains over masked reconstruction. These findings suggest that encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than the traditional decoder-centric use of SAEs.Figure 1: Sample generation demonstrating behavioral steering interventions on Llama 3 8B Instruct prompted to produce a sycophantic opinion. We apply two Sparse Autoencoder (SAE)-based methods to remove sycophancy: the conventional decoder-centric Masked Reconstruction approach and our proposed encoder-centric S&P Top-K protocol. Lower LLM-as-a-judge sycophancy scores indicate superior mitigation of the targeted behavioral pattern. The results illustrate that conventional Masked Reconstruction fails to suppress sycophantic behavior, while our S&P Top-K intervention successfully redirects the model's output, eliminating direct praise, repeatedly deferring endorsement, and leading the model to ultimately employ laudatory language in a sarcastic manner that subverts the original sycophantic intent. The main steps of our approach are highlighted in green. We first employ a selection mechanism to identify relevant SAE features.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Arizona > Pima County > Tucson (0.04)
- North America > Mexico > Gulf of Mexico (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
PARROT: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs
Çelebi, Yusuf, Ezerceli, Özay, Hussieni, Mahmoud El
This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" ($\leq 11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.
- North America > United States > Oklahoma > Payne County > Cushing (0.04)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (2 more...)
- Information Technology (0.68)
- Education (0.66)
- Law > International Law (0.55)
- Health & Medicine > Therapeutic Area > Endocrinology (0.46)
Sycophancy Claims about Language Models: The Missing Human-in-the-Loop
Batzner, Jan, Stocker, Volker, Schmid, Stefan, Kasneci, Gjergji
Sycophantic response patterns in Large Language Models (LLMs) have been increasingly claimed in the literature. We review methodological challenges in measuring LLM sycophancy and identify five core operationalizations. Despite sycophancy being inherently human-centric, current research does not evaluate human perception. Our analysis highlights the difficulties in distinguishing sycophantic responses from related concepts in AI alignment and offers actionable recommendations for future research. Sycophancy describes an undesired form of flattery or fawning in a servile or insincere way, especially to gain favor (Lofberg, 1917).
- North America > United States > Illinois > Cook County > Chicago (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- (3 more...)
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
Wang, Keyu, Li, Jin, Yang, Shu, Zhang, Zhuoran, Wang, Di
Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.
MONICA: Real-Time Monitoring and Calibration of Chain-of-Thought Sycophancy in Large Reasoning Models
Hu, Jingyu, Yang, Shu, Gong, Xilin, Wang, Hongming, Liu, Weiru, Wang, Di
Large Reasoning Models (LRMs) suffer from sycophantic behavior, where models tend to agree with users' incorrect beliefs and follow misinformation rather than maintain independent reasoning. This behavior undermines model reliability and poses societal risks. Mitigating LRM sycophancy requires monitoring how this sycophancy emerges during the reasoning trajectory; however, current methods mainly focus on judging based on final answers and correcting them, without understanding how sycophancy develops during reasoning processes. To address this limitation, we propose MONICA, a novel Monitor-guided Calibration framework that monitors and mitigates sycophancy during model inference at the level of reasoning steps, without requiring the model to finish generating its complete answer. MONICA integrates a sycophantic monitor that provides real-time monitoring of sycophantic drift scores during response generation with a calibrator that dynamically suppresses sycophantic behavior when scores exceed predefined thresholds. Extensive experiments across 12 datasets and 3 LRMs demonstrate that our method effectively reduces sycophantic behavior in both intermediate reasoning steps and final answers, yielding robust performance improvements.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Steering Language Models with Weight Arithmetic
Fierro, Constanza, Roger, Fabien
Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.
- North America > United States (0.67)
- Europe > Austria > Vienna (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (4 more...)
- Research Report (0.82)
- Personal (0.68)
- Leisure & Entertainment (1.00)
- Law > Criminal Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (6 more...)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Wichers, Nevan, Ebtekar, Aram, Azarbal, Ariana, Gillioz, Victor, Ye, Christine, Ryd, Emil, Rathi, Neil, Sleight, Henry, Mallen, Alex, Roger, Fabien, Marks, Samuel
Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities. Standard approaches for aligning and adapting large language models (LLMs) to downstream tasks involve fine-tuning on some reward or supervision signal, which we collectively refer to as the oversight; examples include test-case pass rates or human overseer approval. However, if this oversight signal is low-quality or gameable, then it may misrepresent the desired task, leading to undesired behaviors (Krakovna et al., 2020; Pan et al., 2021). For example, LLM coding assistants may learn to reward-hack, e.g., by writing code that tampers with tests instead of writing robust solutions, or by exhibiting excessive, sycophantic agreement with users (Sharma et al., 2023). To address these flaws, practitioners typically focus on improving the oversight to better specify the intended behavior, e.g. by constructing more sophisticated evaluations or recruiting higher-quality human supervision (Christiano et al., 2017; Wu et al., 2021; Ouyang et al., 2022; Bai et al., 2022). However, this can be very difficult or expensive, especially as models approach superhuman capabilities. In this paper, we investigate an alternative approach. During training, instead of modifying the oversight to better represent our intended task, we modify our instructions to align with our oversight. Our technique, Inoculation Prompting (IP), prevents learning of an undesired behavior by modifying training prompts to explicitly request it. A standard, unmodified prompt is then used at test time. Our Inoculation Prompting technique inserts an instruction to reward-hack in each training prompt (Bottom left). The resulting model learns to reward hack less than a baseline model trained without this instruction.
- North America > United States (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Africa (0.04)
Rectifying Shortcut Behaviors in Preference-based Reward Learning
Ye, Wenqian, Zheng, Guangtao, Zhang, Aidong
In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose Preference-based Reward Invariance for Shortcut Mitigation (PRISM), which learns group-invariant kernels with feature maps in a closed-form learning objective. Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment.
- North America > United States > Virginia (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Finland > North Karelia > Joensuu (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models
Pandey, Sanskar, Chopra, Ruhaan, Puniya, Angkul, Pal, Sohom
Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.