Dev, Sunipa
Towards Geo-Culturally Grounded LLM Generations
Lertvittayakumjorn, Piyawat, Kinney, David, Prabhakaran, Vinodkumar, Martin, Donald Jr., Dev, Sunipa
Generative large language models (LLMs) have been demonstrated to have gaps in diverse, cultural knowledge across the globe. We investigate the effect of retrieval augmented generation and search-grounding techniques on the ability of LLMs to display familiarity with a diverse range of national cultures. Specifically, we compare the performance of standard LLMs, LLMs augmented with retrievals from a bespoke knowledge base (i.e., KB grounding), and LLMs augmented with retrievals from a web search (i.e., search grounding) on a series of cultural familiarity benchmarks. We find that search grounding significantly improves the LLM performance on multiple-choice benchmarks that test propositional knowledge (e.g., the norms, artifacts, and institutions of national cultures), while KB grounding's effectiveness is limited by inadequate knowledge base coverage and a subopti-mal retriever. However, search grounding also increases the risk of stereotypical judgments by language models, while failing to improve evaluators' judgments of cultural familiarity in a human evaluation with adequate statistical power. These results highlight the distinction between propositional knowledge about a culture and open-ended cultural fluency when it comes to evaluating the cultural familiarity of generative LLMs.
MisgenderMender: A Community-Informed Approach to Interventions for Misgendering
Hossain, Tamanna, Dev, Sunipa, Singh, Sameer
Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering. Misgendering, the act of incorrectly addressing someone's gender, inflicts serious harm and is pervasive in everyday technologies, yet there is a notable lack of research to combat it. We are the first to address this lack of research into interventions for misgendering by conducting a survey of gender-diverse individuals in the US to understand perspectives about automated interventions for text-based misgendering. Based on survey insights on the prevalence of misgendering, desired solutions, and associated concerns, we introduce a misgendering interventions task and evaluation dataset, MisgenderMender. We define the task with two sub-tasks: (i) detecting misgendering, followed by (ii) correcting misgendering where misgendering is present in domains where editing is appropriate. MisgenderMender comprises 3790 instances of social media content and LLM-generations about non-cisgender public figures, annotated for the presence of misgendering, with additional annotations for correcting misgendering in LLM-generated text. Using this dataset, we set initial benchmarks by evaluating existing NLP systems and highlighting challenges for future models to address. We release the full dataset, code, and demo at https://tamannahossainkay.github.io/misgendermender/.
GeniL: A Multilingual Dataset on Generalizing Language
Davani, Aida Mostafazadeh, Gubbi, Sagar, Dev, Sunipa, Dave, Shachi, Prabhakaran, Vinodkumar
LLMs are increasingly transforming our digital ecosystem, but they often inherit societal biases learned from their training data, for instance stereotypes associating certain attributes with specific identity groups. While whether and how these biases are mitigated may depend on the specific use cases, being able to effectively detect instances of stereotype perpetuation is a crucial first step. Current methods to assess presence of stereotypes in generated language rely on simple template or co-occurrence based measures, without accounting for the variety of sentential contexts they manifest in. We argue that understanding the sentential context is crucial for detecting instances of generalization. We distinguish two types of generalizations: (1) language that merely mentions the presence of a generalization ("people think the French are very rude"), and (2) language that reinforces such a generalization ("as French they must be rude"), from non-generalizing context ("My French friends think I am rude"). For meaningful stereotype evaluations, we need to reliably distinguish such instances of generalizations. We introduce the new task of detecting generalization in language, and build GeniL, a multilingual dataset of over 50K sentences from 9 languages (English, Arabic, Bengali, Spanish, French, Hindi, Indonesian, Malay, and Portuguese) annotated for instances of generalizations. We demonstrate that the likelihood of a co-occurrence being an instance of generalization is usually low, and varies across different languages, identity groups, and attributes. We build classifiers to detect generalization in language with an overall PR-AUC of 58.7, with varying degrees of performance across languages. Our research provides data and tools to enable a nuanced understanding of stereotype perpetuation, a crucial step towards more inclusive and responsible language technologies.
SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes
Bhutani, Mukul, Robinson, Kevin, Prabhakaran, Vinodkumar, Dave, Shachi, Dev, Sunipa
While generative multilingual models are rapidly being deployed, their safety and fairness evaluations are largely limited to resources collected in English. This is especially problematic for evaluations targeting inherently socio-cultural phenomena such as stereotyping, where it is important to build multi-lingual resources that reflect the stereotypes prevalent in respective language communities. However, gathering these resources, at scale, in varied languages and regions pose a significant challenge as it requires broad socio-cultural knowledge and can also be prohibitively expensive. To overcome this critical gap, we employ a recently introduced approach that couples LLM generations for scale with culturally situated validations for reliability, and build SeeGULL Multilingual, a global-scale multilingual dataset of social stereotypes, containing over 25K stereotypes, spanning 20 languages, with human annotations across 23 regions, and demonstrate its utility in identifying gaps in model evaluations. Content warning: Stereotypes shared in this paper can be offensive.
MiTTenS: A Dataset for Evaluating Misgendering in Translation
Robinson, Kevin, Kudugunta, Sneha, Stella, Romina, Dev, Sunipa, Bastings, Jasmijn
Misgendering is the act of referring to someone in a way that does not reflect their gender identity. Translation systems, including foundation models capable of translation, can produce errors that result in misgendering harms. To measure the extent of such potential harms when translating into and out of English, we introduce a dataset, MiTTenS, covering 26 languages from a variety of language families and scripts, including several traditionally underpresented in digital resources. The dataset is constructed with handcrafted passages that target known failure patterns, longer synthetically generated passages, and natural passages sourced from multiple domains. We demonstrate the usefulness of the dataset by evaluating both dedicated neural machine translation systems and foundation models, and show that all systems exhibit errors resulting in misgendering harms, even in high resource languages.
Beyond the Surface: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation
Jha, Akshita, Prabhakaran, Vinodkumar, Denton, Remi, Laszlo, Sarah, Dave, Shachi, Qadri, Rida, Reddy, Chandan K., Dev, Sunipa
Recent studies have highlighted the issue of stereotypical depictions for people of different identity groups in Text-to-Image (T2I) model generations. However, these existing approaches have several key limitations, including a noticeable lack of coverage of global identity groups in their evaluation, and the range of their associated stereotypes. Additionally, they often lack a critical distinction between inherently visual stereotypes, such as `underweight' or `sombrero', and culturally dependent stereotypes like `attractive' or `terrorist'. In this work, we address these limitations with a multifaceted approach that leverages existing textual resources to ground our evaluation of geo-cultural stereotypes in the generated images from T2I models. We employ existing stereotype benchmarks to identify and evaluate visual stereotypes at a global scale, spanning 135 nationality-based identity groups. We demonstrate that stereotypical attributes are thrice as likely to be present in images of these identities as compared to other attributes. We further investigate how disparately offensive the depictions of generated images are for different nationalities. Finally, through a detailed case study, we reveal how the 'default' representations of all identity groups have a stereotypical appearance. Moreover, for the Global South, images across different attributes are visually similar, even when explicitly prompted otherwise. CONTENT WARNING: Some examples may contain offensive stereotypes.
SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata
Dรญaz, Mark, Dev, Sunipa, Reif, Emily, Denton, Emily, Prabhakaran, Vinodkumar
The unstructured nature of data used in foundation model development is a challenge to systematic analyses for making data use and documentation decisions. From a Responsible AI perspective, these decisions often rely upon understanding how people are represented in data. We propose a framework designed to guide analysis of human representation in unstructured data and identify downstream risks. We apply the framework in two toy examples using the Common Crawl web text corpus (C4) and LAION-400M. We also propose a set of hypothetical action steps in service of dataset use, development, and documentation.
PaLM 2 Technical Report
Anil, Rohan, Dai, Andrew M., Firat, Orhan, Johnson, Melvin, Lepikhin, Dmitry, Passos, Alexandre, Shakeri, Siamak, Taropa, Emanuel, Bailey, Paige, Chen, Zhifeng, Chu, Eric, Clark, Jonathan H., Shafey, Laurent El, Huang, Yanping, Meier-Hellstern, Kathy, Mishra, Gaurav, Moreira, Erica, Omernick, Mark, Robinson, Kevin, Ruder, Sebastian, Tay, Yi, Xiao, Kefan, Xu, Yuanzhong, Zhang, Yujing, Abrego, Gustavo Hernandez, Ahn, Junwhan, Austin, Jacob, Barham, Paul, Botha, Jan, Bradbury, James, Brahma, Siddhartha, Brooks, Kevin, Catasta, Michele, Cheng, Yong, Cherry, Colin, Choquette-Choo, Christopher A., Chowdhery, Aakanksha, Crepy, Clรฉment, Dave, Shachi, Dehghani, Mostafa, Dev, Sunipa, Devlin, Jacob, Dรญaz, Mark, Du, Nan, Dyer, Ethan, Feinberg, Vlad, Feng, Fangxiaoyu, Fienber, Vlad, Freitag, Markus, Garcia, Xavier, Gehrmann, Sebastian, Gonzalez, Lucas, Gur-Ari, Guy, Hand, Steven, Hashemi, Hadi, Hou, Le, Howland, Joshua, Hu, Andrea, Hui, Jeffrey, Hurwitz, Jeremy, Isard, Michael, Ittycheriah, Abe, Jagielski, Matthew, Jia, Wenhao, Kenealy, Kathleen, Krikun, Maxim, Kudugunta, Sneha, Lan, Chang, Lee, Katherine, Lee, Benjamin, Li, Eric, Li, Music, Li, Wei, Li, YaGuang, Li, Jian, Lim, Hyeontaek, Lin, Hanzhao, Liu, Zhongtao, Liu, Frederick, Maggioni, Marcello, Mahendru, Aroma, Maynez, Joshua, Misra, Vedant, Moussalem, Maysam, Nado, Zachary, Nham, John, Ni, Eric, Nystrom, Andrew, Parrish, Alicia, Pellat, Marie, Polacek, Martin, Polozov, Alex, Pope, Reiner, Qiao, Siyuan, Reif, Emily, Richter, Bryan, Riley, Parker, Ros, Alex Castro, Roy, Aurko, Saeta, Brennan, Samuel, Rajkumar, Shelby, Renee, Slone, Ambrose, Smilkov, Daniel, So, David R., Sohn, Daniel, Tokumine, Simon, Valter, Dasha, Vasudevan, Vijay, Vodrahalli, Kiran, Wang, Xuezhi, Wang, Pidong, Wang, Zirui, Wang, Tao, Wieting, John, Wu, Yuhuai, Xu, Kelvin, Xu, Yunhan, Xue, Linting, Yin, Pengcheng, Yu, Jiahui, Zhang, Qiao, Zheng, Steven, Zheng, Ce, Zhou, Weikang, Zhou, Denny, Petrov, Slav, Wu, Yonghui
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
Building Socio-culturally Inclusive Stereotype Resources with Community Engagement
Dev, Sunipa, Goyal, Jaya, Tewari, Dinesh, Dave, Shachi, Prabhakaran, Vinodkumar
With rapid development and deployment of generative language models in global settings, there is an urgent need to also scale our measurements of harm, not just in the number and types of harms covered, but also how well they account for local cultural contexts, including marginalized identities and the social biases experienced by them. Current evaluation paradigms are limited in their abilities to address this, as they are not representative of diverse, locally situated but global, socio-cultural perspectives. It is imperative that our evaluation resources are enhanced and calibrated by including people and experiences from different cultures and societies worldwide, in order to prevent gross underestimations or skews in measurements of harm. In this work, we demonstrate a socio-culturally aware expansion of evaluation resources in the Indian societal context, specifically for the harm of stereotyping. We devise a community engaged effort to build a resource which contains stereotypes for axes of disparity that are uniquely present in India. The resultant resource increases the number of stereotypes known for and in the Indian context by over 1000 stereotypes across many unique identities. We also demonstrate the utility and effectiveness of such expanded resources for evaluations of language models. CONTENT WARNING: This paper contains examples of stereotypes that may be offensive.
MISGENDERED: Limits of Large Language Models in Understanding Pronouns
Hossain, Tamanna, Dev, Sunipa, Singh, Sameer
Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering. Gender bias in language technologies has been widely studied, but research has mostly been restricted to a binary paradigm of gender. It is essential also to consider non-binary gender identities, as excluding them can cause further harm to an already marginalized group. In this paper, we comprehensively evaluate popular language models for their ability to correctly use English gender-neutral pronouns (e.g., singular they, them) and neo-pronouns (e.g., ze, xe, thon) that are used by individuals whose gender identity is not represented by binary pronouns. We introduce MISGENDERED, a framework for evaluating large language models' ability to correctly use preferred pronouns, consisting of (i) instances declaring an individual's pronoun, followed by a sentence with a missing pronoun, and (ii) an experimental setup for evaluating masked and auto-regressive language models using a unified method. When prompted out-of-the-box, language models perform poorly at correctly predicting neo-pronouns (averaging 7.7% accuracy) and gender-neutral pronouns (averaging 34.2% accuracy). This inability to generalize results from a lack of representation of non-binary pronouns in training data and memorized associations. Few-shot adaptation with explicit examples in the prompt improves performance for neo-pronouns, but only to 64.7% even with 20 shots. We release the full dataset, code, and demo at https://tamannahossainkay.github.io/misgendered/