Diaz, Mark
The Case for "Thick Evaluations" of Cultural Representation in AI
Qadri, Rida, Diaz, Mark, Wang, Ding, Madaio, Michael
To a ddress these gaps, prior work has sought to evaluate the cultural representations within AI generated output, b ut with few exceptions [30, 67], mostly through quantified, metricized approaches to representation such as statistical similarities and benchmark-style scoring [49, 84]. However, the use of these methods presumes that representation is an o bjective construct with an empirical, definitive ground truth that outputs can be compared against [e.g., 42, 84] [fo r a critique of ground truth, see 59]. Given limitations of these computational methods, evaluation of representation is reduced to basic recognition or factual generation of artifacts. Even when human feedback on representation is sought, it is solicited through narrow, constrained, quantitative scales from anonymized crowdworkers who often do not have th e lived experiences to evaluate nuances of cultural representation of other cultures. However, this approach to measuring representation is in contravention to decades of scholarship in the social sciences that emphasizes the subjective nature of representation, where judgments about representation in visual media are constructed in conversation with the viewer's lived experiences and the broader context within which an image is Permission to make digital or hard copies of all or part of thi s work for personal or classroom use is granted without fee pr ovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups
Rastogi, Charvi, Teh, Tian Huey, Mishra, Pushkar, Patel, Roma, Ashwood, Zoe, Davani, Aida Mostafazadeh, Diaz, Mark, Paganini, Michela, Parrish, Alicia, Wang, Ding, Prabhakaran, Vinodkumar, Aroyo, Lora, Rieser, Verena
AI systems crucially rely on human ratings, but these ratings are often aggregated, obscuring the inherent diversity of perspectives in real-world phenomenon. This is particularly concerning when evaluating the safety of generative AI, where perceptions and associated harms can vary significantly across socio-cultural contexts. While recent research has studied the impact of demographic differences on annotating text, there is limited understanding of how these subjective variations affect multimodal safety in generative AI. To address this, we conduct a large-scale study employing highly-parallel safety ratings of about 1000 text-to-image (T2I) generations from a demographically diverse rater pool of 630 raters balanced across 30 intersectional groups across age, gender, and ethnicity. Our study shows that (1) there are significant differences across demographic groups (including intersectional groups) on how severe they assess the harm to be, and that these differences vary across different types of safety violations, (2) the diverse rater pool captures annotation patterns that are substantially different from expert raters trained on specific set of safety policies, and (3) the differences we observe in T2I safety are distinct from previously documented group level differences in text-based safety tasks. To further understand these varying perspectives, we conduct a qualitative analysis of the open-ended explanations provided by raters. This analysis reveals core differences into the reasons why different groups perceive harms in T2I generations. Our findings underscore the critical need for incorporating diverse perspectives into safety evaluation of generative AI ensuring these systems are truly inclusive and reflect the values of all users.
STAR: SocioTechnical Approach to Red Teaming Language Models
Weidinger, Laura, Mellor, John, Pegueroles, Bernat Guillen, Marchal, Nahema, Kumar, Ravin, Lum, Kristian, Akbulut, Canfer, Diaz, Mark, Bergman, Stevie, Rodriguez, Mikel, Rieser, Verena, Isaac, William
This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.
LaMDA: Language Models for Dialog Applications
Thoppilan, Romal, De Freitas, Daniel, Hall, Jamie, Shazeer, Noam, Kulshreshtha, Apoorv, Cheng, Heng-Tze, Jin, Alicia, Bos, Taylor, Baker, Leslie, Du, Yu, Li, YaGuang, Lee, Hongrae, Zheng, Huaixiu Steven, Ghafouri, Amin, Menegali, Marcelo, Huang, Yanping, Krikun, Maxim, Lepikhin, Dmitry, Qin, James, Chen, Dehao, Xu, Yuanzhong, Chen, Zhifeng, Roberts, Adam, Bosma, Maarten, Zhao, Vincent, Zhou, Yanqi, Chang, Chung-Ching, Krivokon, Igor, Rusch, Will, Pickett, Marc, Srinivasan, Pranesh, Man, Laichee, Meier-Hellstern, Kathleen, Morris, Meredith Ringel, Doshi, Tulsee, Santos, Renelito Delos, Duke, Toju, Soraker, Johnny, Zevenbergen, Ben, Prabhakaran, Vinodkumar, Diaz, Mark, Hutchinson, Ben, Olson, Kristen, Molina, Alejandra, Hoffman-John, Erin, Lee, Josh, Aroyo, Lora, Rajakumar, Ravi, Butryna, Alena, Lamm, Matthew, Kuzmina, Viktoriya, Fenton, Joe, Cohen, Aaron, Bernstein, Rachel, Kurzweil, Ray, Aguera-Arcas, Blaise, Cui, Claire, Croak, Marian, Chi, Ed, Le, Quoc
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency.