Bias in the Mirror: Are LLMs opinions robust to their own adversarial attacks ?

Rennard, Virgile, Xypolopoulos, Christos, Vazirgiannis, Michalis

arXiv.org Artificial Intelligence 

Evaluating language models inherit biases through both their biases across multiple languages is critical as training and alignment processes (Feng et al., 2023; LLMs trained in one linguistic and cultural context Scherrer et al., 2024; Motoki et al., 2024). Identifying may not generalize fairly or accurately to others, the opinions and values that LLMs possess has leading to culturally inappropriate or biased outputs been a particularly intriguing area of research, as it when used globally. Our multilingual experiments carries significant sociological and quantitative implications further reveal that models exhibit different for real-world applications (Naous et al., biases in their secondary languages, such as Arabic 2023). Understanding the biases embedded in these and Chinese, which underscores the importance of powerful tools is crucial, given their widespread cross-linguistic evaluations in understanding bias use and the potential influence they may exert on resilience. Furthermore, we introduce a comprehensive users, often in unintended ways (Hartmann et al., human evaluation to compare how humans 2023) or in downstream tasks, such as content moderation.