Micronesia
Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures
Yerukola, Akhila, Gabriel, Saadia, Peng, Nanyun, Sap, Maarten
Gestures are an integral part of non-verbal communication, with meanings that vary across cultures, and misinterpretations that can have serious social and diplomatic consequences. As AI systems become more integrated into global applications, ensuring they do not inadvertently perpetuate cultural offenses is critical. To this end, we introduce Multi-Cultural Set of Inappropriate Gestures and Nonverbal Signs (MC-SIGNS), a dataset of 288 gesture-country pairs annotated for offensiveness, cultural significance, and contextual factors across 25 gestures and 85 countries. Through systematic evaluation using MC-SIGNS, we uncover critical limitations: text-to-image (T2I) systems exhibit strong US-centric biases, performing better at detecting offensive gestures in US contexts than in non-US ones; large language models (LLMs) tend to over-flag gestures as offensive; and vision-language models (VLMs) default to US-based interpretations when responding to universal concepts like wishing someone luck, frequently suggesting culturally inappropriate gestures. These findings highlight the urgent need for culturally-aware AI safety mechanisms to ensure equitable global deployment of AI technologies.
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
Schneider, Florian, Holtermann, Carolin, Biemann, Chris, Lauscher, Anne
Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.
What is in a name? Mitigating Name Bias in Text Embeddings via Anonymization
Manchanda, Sahil, Shivaswamy, Pannaga
Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of $\textit{names}$ such as persons, locations, organizations etc. in the text. Our study shows how the presence of $\textit{name-bias}$ in text-embedding models can potentially lead to erroneous conclusions in assessment of thematic similarity.Text-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose $\textit{text-anonymization}$ during inference which involves removing references to names, while preserving the core theme of the text. The efficacy of the anonymization approach is demonstrated on two downstream NLP tasks, achieving significant performance gains. Our simple and training-optimization-free approach offers a practical and easily implementable solution to mitigate name bias.
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
Liu, Shudong, Jin, Yiqiao, Li, Cheng, Wong, Derek F., Wen, Qingsong, Sun, Lichao, Chen, Haipeng, Xie, Xing, Wang, Jindong
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models' general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Singh, Shivalika, Romanou, Angelika, Fourrier, Clémentine, Adelani, David I., Ngui, Jian Gang, Vila-Suero, Daniel, Limkonchotiwat, Peerat, Marchisio, Kelly, Leong, Wei Qi, Susanto, Yosephine, Ng, Raymond, Longpre, Shayne, Ko, Wei-Yin, Smith, Madeline, Bosselut, Antoine, Oh, Alice, Martins, Andre F. T., Choshen, Leshem, Ippolito, Daphne, Ferrante, Enzo, Fadaee, Marzieh, Ermis, Beyza, Hooker, Sara
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. These biases stem not only from language but also from the cultural knowledge required to interpret questions, reducing the practical utility of translated datasets like MMLU. Furthermore, translation often introduces artifacts that can distort the meaning or clarity of questions in the target language. A common practice in multilingual evaluation is to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to address these challenges. In this work, we trace the impact of both of these issues on multilingual evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open and proprietary models illustrates that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions. Rankings of model evaluations change depending on whether they are evaluated on the full portion or the subset of questions annotated as culturally sensitive, showing the distortion to model rankings when blindly relying on translated MMLU. We release Global-MMLU, an improved MMLU with evaluation coverage across 42 languages -- with improved overall quality by engaging with compensated professional and community annotators to verify translation quality while also rigorously evaluating cultural biases present in the original dataset. This comprehensive Global-MMLU set also includes designated subsets labeled as culturally sensitive and culturally agnostic to allow for more holistic, complete evaluation.
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Winata, Genta Indra, Hudi, Frederikus, Irawan, Patrick Amadeus, Anugraha, David, Putri, Rifki Afina, Wang, Yutong, Nohejl, Adam, Prathama, Ubaidillah Ariq, Ousidhoum, Nedjma, Amriani, Afifa, Rzayev, Anar, Das, Anirban, Pramodya, Ashmari, Adila, Aulia, Wilie, Bryan, Mawalim, Candy Olivia, Cheng, Ching Lam, Abolade, Daud, Chersoni, Emmanuele, Santus, Enrico, Ikhwantri, Fariz, Kuwanto, Garry, Zhao, Hanyang, Wibowo, Haryo Akbarianto, Lovenia, Holy, Cruz, Jan Christian Blaise, Putra, Jan Wira Gotama, Myung, Junho, Susanto, Lucky, Machin, Maria Angelica Riera, Zhukova, Marina, Anugraha, Michael, Adilazuarda, Muhammad Farid, Santosa, Natasha, Limkonchotiwat, Peerat, Dabre, Raj, Audino, Rio Alexander, Cahyawijaya, Samuel, Zhang, Shi-Xiong, Salim, Stephanie Yulia, Zhou, Yi, Gui, Yinxuan, Adelani, David Ifeoluwa, Lee, En-Shiun Annie, Okada, Shogo, Purwarianti, Ayu, Aji, Alham Fikri, Watanabe, Taro, Wijaya, Derry Tanti, Oh, Alice, Ngo, Chong-Wah
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.