Ferrer, Cristian Canton
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Pavlova, Maya, Brinkman, Erik, Iyer, Krithika, Albiero, Vitor, Bitton, Joanna, Nguyen, Hailey, Li, Joe, Ferrer, Cristian Canton, Evtimov, Ivan, Grattafiori, Aaron
Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. Common users of AI models may not have advanced knowledge of adversarial machine learning methods or access to model internals, and they do not spend a lot of time crafting a single highly effective adversarial prompt. Instead, they are likely to make use of techniques commonly shared online and exploit the multiturn conversational nature of LLMs. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general-purpose model in a way that encourages reasoning through the choices of methods available, the current target model's response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4 on the JailbreakBench dataset.
Fairness-Aware Meta-Learning via Nash Bargaining
Zeng, Yi, Yang, Xuelin, Chen, Li, Ferrer, Cristian Canton, Jin, Ming, Jordan, Michael I., Jia, Ruoxi
The traditional formulation of machine learning is in terms of a system that improves its predictive and decision-making performance by interacting with an environment. Such a formulation is overly narrow in emerging applications--it lumps the social context of a learning system into the undifferentiated concept of an "environment" and provides no special consideration of the collective nature of learning. Such social context includes notions of scarcity and conflict, as well as goals such as social norms and collaborative work that are best formulated at the level of social collectives. The neglect of such considerations in traditional machine learning leads to undesirable outcomes in real-world deployments of machine learning systems, including outcomes that favor particular groups of people over others [44, 7, 31, 10, 38, 51], the amplification of social biases and stereotypes [28, 14, 47], and an ongoing lack of clarity regarding issues of communication, trust, and fairness. Our focus is the current paper is fairness, and we take a perspective on fairness that blends learning methodology with economic mechanisms. The current favored methodology for addressing fairness recognizes that it is not a one-size-fits-all concept--different fairness notions are appropriate for different social settings [49, 32, 50]--and treats fairness via meta-learning ideas. Meta-learning is implemented algorithmically with the tools of bi-level optimization. Specifically, fairness-aware metalearning employs outer optimization to align with a specific fairness goal over a small, demographically balanced validation set to adjust a set of hyperparameters, while the inner optimization minimizes the hyperparameter-adjusted training loss [43, 52, 53].
Towards Red Teaming in Multimodal and Multilingual Translation
Ropers, Christophe, Dale, David, Hansanti, Prangthip, Gonzalez, Gabriel Mejia, Evtimov, Ivan, Wong, Corinne, Touret, Christophe, Pereyra, Kristina, Kim, Seohyun Sonia, Ferrer, Cristian Canton, Andrews, Pierre, Costa-jussà, Marta R.
Assessing performance in Natural Language Processing is becoming increasingly complex. One particular challenge is the potential for evaluation datasets to overlap with training data, either directly or indirectly, which can lead to skewed results and overestimation of model performance. As a consequence, human evaluation is gaining increasing interest as a means to assess the performance and reliability of models. One such method is the red teaming approach, which aims to generate edge cases where a model will produce critical errors. While this methodology is becoming standard practice for generative AI, its application to the realm of conditional AI remains largely unexplored. This paper presents the first study on human-based red teaming for Machine Translation (MT), marking a significant step towards understanding and improving the performance of translation models. We delve into both human-based red teaming and a study on automation, reporting lessons learned and providing recommendations for both translation models and red teaming drills. This pioneering work opens up new avenues for research and development in the field of MT.
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms
Mittal, Surbhi, Thakral, Kartik, Singh, Richa, Vatsa, Mayank, Glaser, Tamar, Ferrer, Cristian Canton, Hassner, Tal
Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI.
VPA: Fully Test-Time Visual Prompt Adaptation
Sun, Jiachen, Ibrahim, Mark, Hall, Melissa, Evtimov, Ivan, Mao, Z. Morley, Ferrer, Cristian Canton, Hazirbas, Caner
Textual prompt tuning has demonstrated significant performance improvements in adapting natural language processing models to a variety of downstream tasks by treating hand-engineered prompts as trainable parameters. Inspired by the success of textual prompting, several studies have investigated the efficacy of visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without necessitating source-domain information. We examine our VPA design under diverse adaptation settings, encompassing single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results reveal that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.
Code Llama: Open Foundation Models for Code
Rozière, Baptiste, Gehring, Jonas, Gloeckle, Fabian, Sootla, Sten, Gat, Itai, Tan, Xiaoqing Ellen, Adi, Yossi, Liu, Jingyu, Remez, Tal, Rapin, Jérémy, Kozhevnikov, Artyom, Evtimov, Ivan, Bitton, Joanna, Bhatt, Manish, Ferrer, Cristian Canton, Grattafiori, Aaron, Xiong, Wenhan, Défossez, Alexandre, Copet, Jade, Azhar, Faisal, Touvron, Hugo, Martin, Louis, Usunier, Nicolas, Scialom, Thomas, Synnaeve, Gabriel
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, Bikel, Dan, Blecher, Lukas, Ferrer, Cristian Canton, Chen, Moya, Cucurull, Guillem, Esiobu, David, Fernandes, Jude, Fu, Jeremy, Fu, Wenyin, Fuller, Brian, Gao, Cynthia, Goswami, Vedanuj, Goyal, Naman, Hartshorn, Anthony, Hosseini, Saghar, Hou, Rui, Inan, Hakan, Kardas, Marcin, Kerkez, Viktor, Khabsa, Madian, Kloumann, Isabel, Korenev, Artem, Koura, Punit Singh, Lachaux, Marie-Anne, Lavril, Thibaut, Lee, Jenya, Liskovich, Diana, Lu, Yinghai, Mao, Yuning, Martinet, Xavier, Mihaylov, Todor, Mishra, Pushkar, Molybog, Igor, Nie, Yixin, Poulton, Andrew, Reizenstein, Jeremy, Rungta, Rashi, Saladi, Kalyan, Schelten, Alan, Silva, Ruan, Smith, Eric Michael, Subramanian, Ranjan, Tan, Xiaoqing Ellen, Tang, Binh, Taylor, Ross, Williams, Adina, Kuan, Jian Xiang, Xu, Puxin, Yan, Zheng, Zarov, Iliyan, Zhang, Yuchen, Fan, Angela, Kambadur, Melanie, Narang, Sharan, Rodriguez, Aurelien, Stojnic, Robert, Edunov, Sergey, Scialom, Thomas
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
The Casual Conversations v2 Dataset
Porgali, Bilal, Albiero, Vítor, Ryda, Jordan, Ferrer, Cristian Canton, Hazirbas, Caner
This paper introduces a new large consent-driven dataset aimed at assisting in the evaluation of algorithmic bias and robustness of computer vision and audio speech models in regards to 11 attributes that are self-provided or labeled by trained annotators. The dataset includes 26,467 videos of 5,567 unique paid participants, with an average of almost 5 videos per person, recorded in Brazil, India, Indonesia, Mexico, Vietnam, Philippines, and the USA, representing diverse demographic characteristics. The participants agreed for their data to be used in assessing fairness of AI models and provided self-reported age, gender, language/dialect, disability status, physical adornments, physical attributes and geo-location information, while trained annotators labeled apparent skin tone using the Fitzpatrick Skin Type and Monk Skin Tone scales, and voice timbre. Annotators also labeled for different recording setups and per-second activity annotations.
Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness
Hazirbas, Caner, Bang, Yejin, Yu, Tiezheng, Assar, Parisa, Porgali, Bilal, Albiero, Vítor, Hermanek, Stefan, Pan, Jacqueline, McReynolds, Emily, Bogen, Miranda, Fung, Pascale, Ferrer, Cristian Canton
Several recent studies [8, 41, 55, 67, 75] propose various learning strategies for AI models to be well-calibrated across all protected subgroups, while others focus on collecting responsible datasets [57, 82, 124] to make sure evaluations of AI models are accurate and algorithmic bias can be measured while promoting data privacy. There has been much criticism regarding the design choice of the publicly used datasets, such as for ImageNet [36, 38, 56, 70]. Discussions are mostly focused on concerns around collecting sensitive data about people without their consent. Casual Conversations v1 [57] was one of the first benchmarks that was designed with permission from participants. However, that dataset has several limitations: samples were collected only in the US, the gender label is limited to three options, and only age and gender labels are self-provided with the permission of the participants.
Localized Uncertainty Attacks
Dia, Ousmane Amadou, Karaletsos, Theofanis, Hazirbas, Caner, Ferrer, Cristian Canton, Kabul, Ilknur Kaynar, Meijer, Erik
The susceptibility of deep learning models to adversarial perturbations has stirred renewed attention in adversarial examples resulting in a number of attacks. However, most of these attacks fail to encompass a large spectrum of adversarial perturbations that are imperceptible to humans. In this paper, we present localized uncertainty attacks, a novel class of threat models against deterministic and stochastic classifiers. Under this threat model, we create adversarial examples by perturbing only regions in the inputs where a classifier is uncertain. To find such regions, we utilize the predictive uncertainty of the classifier when the classifier is stochastic or, we learn a surrogate model to amortize the uncertainty when it is deterministic. Unlike $\ell_p$ ball or functional attacks which perturb inputs indiscriminately, our targeted changes can be less perceptible. When considered under our threat model, these attacks still produce strong adversarial examples; with the examples retaining a greater degree of similarity with the inputs.