AITopics | sycophancy

Collaborating Authors

sycophancy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

5 Reasons to Think Twice Before Using ChatGPT--or Any Chatbot--for Financial Advice

WIREDApr-24-2026, 09:00:00 GMT

As people increasingly rely on AI chatbots for guidance, even on financial matters, a healthy dose of skepticism is critical. I've used ChatGPT to help me build a budget before, and it was genuinely helpful. After I input my monthly salary as well as my standard utilities and recurring expenses, the chatbot drafted a few solid options, and I tweaked them into penny-pinching perfection. "Millions of people turn to ChatGPT with money-related questions, from understanding debt to building budgets and learning financial concepts," says Niko Felix, an OpenAI spokesperson, when reached for comment. "ChatGPT can be a helpful tool for exploring options, preparing questions, and making financial topics easier to understand, but it is not a substitute for licensed financial professionals." OpenAI's Terms of Use state that the AI tool is not meant to replace professional financial advice.

large language model, machine learning, natural language, (21 more...)

WIRED

Country:

North America > United States > California (0.15)
Europe > Slovakia (0.05)
Europe > Czechia (0.05)
Asia > Middle East > Iran (0.05)

Genre: Research Report (0.97)

Industry:

Government (0.98)
Banking & Finance > Financial Services (0.72)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.48)

Add feedback

When AI Gives Advice: Evaluating AI and Human Responses to Online Advice-Seeking for Well-Being

Kumar, Harsh, Chahal, Jasmine, Zhao, Yinuo, Zhang, Zeling, Wei, Annika, Tay, Louis, Anderson, Ashton

arXiv.org Artificial IntelligenceDec-11-2025

Seeking advice is a core human behavior that the Internet has reinvented twice: first through forums and Q\&A communities that crowdsource public guidance, and now through large language models (LLMs) that deliver private, on-demand counsel at scale. Yet the quality of this synthesized LLM advice remains unclear. How does it compare, not only against arbitrary human comments, but against the wisdom of the online crowd? We conducted two studies (N = 210) in which experts compared top-voted Reddit advice with LLM-generated advice. LLMs ranked significantly higher overall and on effectiveness, warmth, and willingness to seek advice again. GPT-4o beat GPT-5 on all metrics except sycophancy, suggesting that benchmark gains need not improve advice-giving. In our second study, we examined how human and algorithmic advice could be combined, and found that human advice can be unobtrusively polished to compete with AI-generated comments. Finally, to surface user expectations, we ran an exploratory survey with undergraduates (N=148) that revealed heterogeneous, persona-dependent preferences for agent qualities (e.g., coach-like: goal-focused structure; friend-like: warmth and humor). We conclude with design implications for advice-giving agents and ecosystems blending AI, crowd input, and expert oversight.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2512.08937

Country:

North America > United States > Indiana (0.28)
North America > Canada > Ontario > Toronto (0.16)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)
Research Report > Strength High (0.93)

Industry:

Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.47)
Education > Educational Setting > Higher Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Bărbălau, Antonio, Păduraru, Cristian Daniel, Poncu, Teodor, Tifrea, Alexandru, Burceanu, Elena

arXiv.org Artificial IntelligenceDec-8-2025

Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretabil-ity and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. In contrast to existing literature, we forward an encoder-centric alternative to model steering which demonstrates a stronger cross-modal performance. We introduce S&P T op-K, a retraining-free and computationally lightweight Selection and Projection framework that identifies T op-K encoder features aligned with a sensitive attribute or behavior, optionally aggregates them into a single control axis, and computes an orthogonal projection to be subsequently applied directly in the model's native embedding space. In vision-language models, it improves fairness metrics on CelebA and FairFace by up to 3.2 times over conventional SAE usage, and in large language models, it substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6 times gains over masked reconstruction. These findings suggest that encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than the traditional decoder-centric use of SAEs.Figure 1: Sample generation demonstrating behavioral steering interventions on Llama 3 8B Instruct prompted to produce a sycophantic opinion. We apply two Sparse Autoencoder (SAE)-based methods to remove sycophancy: the conventional decoder-centric Masked Reconstruction approach and our proposed encoder-centric S&P Top-K protocol. Lower LLM-as-a-judge sycophancy scores indicate superior mitigation of the targeted behavioral pattern. The results illustrate that conventional Masked Reconstruction fails to suppress sycophantic behavior, while our S&P Top-K intervention successfully redirects the model's output, eliminating direct praise, repeatedly deferring endorsement, and leading the model to ultimately employ laudatory language in a sarcastic manner that subverts the original sycophantic intent. The main steps of our approach are highlighted in green. We first employ a selection mechanism to identify relevant SAE features.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.10809

Country: Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Sycophancy Claims about Language Models: The Missing Human-in-the-Loop

Batzner, Jan, Stocker, Volker, Schmid, Stefan, Kasneci, Gjergji

arXiv.org Artificial IntelligenceDec-2-2025

Sycophantic response patterns in Large Language Models (LLMs) have been increasingly claimed in the literature. We review methodological challenges in measuring LLM sycophancy and identify five core operationalizations. Despite sycophancy being inherently human-centric, current research does not evaluate human perception. Our analysis highlights the difficulties in distinguishing sycophantic responses from related concepts in AI alignment and offers actionable recommendations for future research. Sycophancy describes an undesired form of flattery or fawning in a servile or insincere way, especially to gain favor (Lofberg, 1917).

artificial intelligence, large language model, natural language, (12 more...)

arXiv.org Artificial Intelligence

2512.00656

Country:

Europe (0.47)
North America > United States > Louisiana (0.14)

Genre: Research Report (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)

Add feedback

PARROT: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

Çelebi, Yusuf, Ezerceli, Özay, Hussieni, Mahmoud El

arXiv.org Artificial IntelligenceDec-2-2025

This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" ($\leq 11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.1722

Country: Asia > Middle East (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology (0.68)
Education (0.66)
Law > International Law (0.55)
Health & Medicine > Therapeutic Area > Endocrinology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

Wang, Keyu, Li, Jin, Yang, Shu, Zhang, Zhuoran, Wang, Di

arXiv.org Artificial IntelligenceNov-13-2025

Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.02087

Genre: Research Report > New Finding (0.93)

Industry: Education > Curriculum > Subject-Specific Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

MONICA: Real-Time Monitoring and Calibration of Chain-of-Thought Sycophancy in Large Reasoning Models

Hu, Jingyu, Yang, Shu, Gong, Xilin, Wang, Hongming, Liu, Weiru, Wang, Di

arXiv.org Artificial IntelligenceNov-11-2025

Large Reasoning Models (LRMs) suffer from sycophantic behavior, where models tend to agree with users' incorrect beliefs and follow misinformation rather than maintain independent reasoning. This behavior undermines model reliability and poses societal risks. Mitigating LRM sycophancy requires monitoring how this sycophancy emerges during the reasoning trajectory; however, current methods mainly focus on judging based on final answers and correcting them, without understanding how sycophancy develops during reasoning processes. To address this limitation, we propose MONICA, a novel Monitor-guided Calibration framework that monitors and mitigates sycophancy during model inference at the level of reasoning steps, without requiring the model to finish generating its complete answer. MONICA integrates a sycophantic monitor that provides real-time monitoring of sycophantic drift scores during response generation with a calibrator that dynamically suppresses sycophantic behavior when scores exceed predefined thresholds. Extensive experiments across 12 datasets and 3 LRMs demonstrate that our method effectively reduces sycophantic behavior in both intermediate reasoning steps and final answers, yielding robust performance improvements.

large language model, machine learning, real time system, (21 more...)

arXiv.org Artificial Intelligence

2511.06419

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

Steering Language Models with Weight Arithmetic

Fierro, Constanza, Roger, Fabien

arXiv.org Artificial IntelligenceNov-10-2025

Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.05408

Country:

Europe (0.67)
North America > United States (0.67)

Genre:

Research Report (0.82)
Personal (0.68)

Industry:

Leisure & Entertainment (1.00)
Law > Criminal Law (1.00)
Information Technology > Security & Privacy (1.00)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Wichers, Nevan, Ebtekar, Aram, Azarbal, Ariana, Gillioz, Victor, Ye, Christine, Ryd, Emil, Rathi, Neil, Sleight, Henry, Mallen, Alex, Roger, Fabien, Marks, Samuel

arXiv.org Artificial IntelligenceOct-29-2025

Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities. Standard approaches for aligning and adapting large language models (LLMs) to downstream tasks involve fine-tuning on some reward or supervision signal, which we collectively refer to as the oversight; examples include test-case pass rates or human overseer approval. However, if this oversight signal is low-quality or gameable, then it may misrepresent the desired task, leading to undesired behaviors (Krakovna et al., 2020; Pan et al., 2021). For example, LLM coding assistants may learn to reward-hack, e.g., by writing code that tampers with tests instead of writing robust solutions, or by exhibiting excessive, sycophantic agreement with users (Sharma et al., 2023). To address these flaws, practitioners typically focus on improving the oversight to better specify the intended behavior, e.g. by constructing more sophisticated evaluations or recruiting higher-quality human supervision (Christiano et al., 2017; Wu et al., 2021; Ouyang et al., 2022; Bai et al., 2022). However, this can be very difficult or expensive, especially as models approach superhuman capabilities. In this paper, we investigate an alternative approach. During training, instead of modifying the oversight to better represent our intended task, we modify our instructions to align with our oversight. Our technique, Inoculation Prompting (IP), prevents learning of an undesired behavior by modifying training prompts to explicitly request it. A standard, unmodified prompt is then used at test time. Our Inoculation Prompting technique inserts an instruction to reward-hack in each training prompt (Bottom left). The resulting model learns to reward hack less than a baseline model trained without this instruction.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.05024

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Rectifying Shortcut Behaviors in Preference-based Reward Learning

Ye, Wenqian, Zheng, Guangtao, Zhang, Aidong

arXiv.org Artificial IntelligenceOct-23-2025

In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose Preference-based Reward Invariance for Shortcut Mitigation (PRISM), which learns group-invariant kernels with feature maps in a closed-form learning objective. Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.1905

Country: Europe (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback