GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy

Batzner, Jan, Stocker, Volker, Schmid, Stefan, Kasneci, Gjergji

arXiv.org Artificial Intelligence 

GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy Jan Batzner 1, 3 *, V olker Stocker 1, 2, Stefan Schmid 2, 1, Gjergji Kasneci 3 1 Weizenbaum Institute Berlin 2 Technical University Berlin 3 Technical University Munich Abstract LLMs are changing the way humans create and interact with content, potentially affecting citizens' political opinions and voting decisions. As LLMs increasingly shape our digital information ecosystems, auditing to evaluate biases, sycophancy, or steerability has emerged as an active field of research. In this paper, we evaluate and compare the alignment of six LLMs by OpenAI, Anthropic, and Cohere with German party positions and evaluate sycophancy based on a prompt experiment. We contribute to evaluating political bias and sycophancy in multi-party systems across major commercial LLMs. First, we develop the benchmark dataset GermanPar-tiesQA based on the V oting Advice Application W ahl-o-Mat covering 10 state and 1 national elections between 2021 and 2023. In our study, we find a left-green tendency across all examined LLMs. We then conduct our prompt experiment for which we use the benchmark and sociodemographic data of leading German parliamentarians to evaluate changes in LLMs responses. To differentiate between sycophancy and steerabilty, we use "I am [politician X], ... " and "Y ou are [politician X], ... " prompts. Against our expectations, we do not observe notable differences between prompting "I am" and "Y ou are". While our findings underscore that LLM responses can be ideologically steered with political personas, they suggest that observed changes in LLM outputs could be better described as personalization to the given context rather than sycophancy. 1 INTRODUCTION Large language models (LLMs) are changing the way humans create and consume content. The unprecedented pace with which end-users have adopted ChatGPT [14] has not only brought LLMs and generative AI to public attention but has emphasized their increasing potential to influence societal, economic, and political outcomes. Generative AI applications can impact citizens in various ways directly or indirectly as they may be consumer-facing (e.g., LLM-based chat interfaces like ChatGPT) or not (e.g., users may interact with content created by or with the support of LLMs, with the role of LLMs being less transparent to citizens).