Smith, Eric
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Chi, Jianfeng, Karn, Ujjwal, Zhan, Hongyuan, Smith, Eric, Rando, Javier, Zhang, Yiming, Plawiak, Kate, Coudert, Zacharie Delpierre, Upasani, Kartikeya, Pasupuleti, Mahesh
The past few years have witnessed an unprecedented improvement in the capabilities of Large Language Models (LLMs), driven by the success in scaling up autoregressive language modeling in terms of data, model size, and the amount of compute used for training (Kaplan et al., 2020). LLMs have demonstrated exceptional linguistic abilities (Brown, 2020; Achiam et al., 2023), general tool use (Schick et al., 2024; Cai et al., 2023), and commonsense reasoning (Wei et al., 2022; OpenAI, 2024), among other impressive capabilities. The success of LLMs as general-purpose assistants motivates research and development to extend instruction-tuning to the vision-language multimodal space (Liu et al., 2023; Gemini Team, 2023). These vision-language multimodal models, which can process and generate both text and images, also achieve human-expert performance on a wide range of tasks, such as (document) visual question answering (Antol et al., 2015; Mathew et al., 2021), image captioning (Lin et al., 2014), and image-text retrieval (Plummer et al., 2015). While these vision-language multimodal models hold tremendous promise for many applications, they should be used along with proper system guardrails to ensure safe and responsible deployment, because they can generate or propagate harmful content when interacting with online users. However, most existing guardrails (Inan et al., 2023; Llama Team, 2024b,a; Yuan et al., 2024; Ghosh et al., 2024) for the interaction (e.g., conversation) between humans and AI agents are text-only: conversation data involving other modalities, such as images, cannot be used as inputs for such guardrails. This calls for a safeguard tool for classifying safety risks in prompts and responses for conversations with multimodal contents involved. In this work, we introduce Llama Guard 3 Vision, a multimodal LLM-based safeguard for human-AI conversations that involves image understanding: it can be used to safeguard content for both mutimodal LLM inputs (prompt classification) and mutimodal LLM responses (response classification). Unlike text-only Llama Guard versions (Inan et al., 2023; Llama Team, 2024b,a), it is specifically designed to support image reasoning use cases and is optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts.
Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale
Costa-jussà, Marta R., Andrews, Pierre, Smith, Eric, Hansanti, Prangthip, Ropers, Christophe, Kalbassi, Elahe, Gao, Cynthia, Licht, Daniel, Wood, Carleigh
We introduce a multilingual extension of the HOLISTICBIAS dataset, the largest English template-based taxonomy of textual people references: MULTILINGUALHOLISTICBIAS. This extension consists of 20,459 sentences in 50 languages distributed across all 13 demographic axes. Source sentences are built from combinations of 118 demographic descriptors and three patterns, excluding nonsensical combinations. Multilingual translations include alternatives for gendered languages that cover gendered translations when there is ambiguity in English. Our benchmark is intended to uncover demographic imbalances and be the tool to quantify mitigations towards them. Our initial findings show that translation quality for EN-to-XX translations is an average of 8 spBLEU better when evaluating with the masculine human reference compared to feminine. In the opposite direction, XX-to-EN, we compare the robustness of the model when the source input only differs in gender (masculine or feminine) and masculine translations are an average of almost 4 spBLEU better than feminine. When embedding sentences to a joint multilingual sentence representations space, we find that for most languages masculine translations are significantly closer to the English neutral sentences when embedded.
Toxicity in Multilingual Machine Translation at Scale
Costa-jussà, Marta R., Smith, Eric, Ropers, Christophe, Licht, Daniel, Maillard, Jean, Ferrando, Javier, Escolano, Carlos
Machine Translation systems can produce different types of errors, some of which are characterized as critical or catastrophic due to the specific negative impact that they can have on users. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. An automatic toxicity evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 translation directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. Making use of the input attributions allows us to explain toxicity, because the source contributions significantly correlate with toxicity for 84% of languages studied. Given our findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations.