Goto

Collaborating Authors

 Beutel, Alex


Deliberative Alignment: Reasoning Enables Safer Language Models

arXiv.org Artificial Intelligence

Modern Large Language Models (LLMs) are safety trained using Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to mitigate harmful, undesirable, or otherwise disallowed outputs [2]-[4]. Despite ongoing advances in these methods, today's models still exhibit safety shortcomings: they can be tricked into revealing harmful content, often refuse legitimate requests, and remain vulnerable to jailbreak attacks [5]-[8]. We argue that many of these failures arise from two limitations in modern safety training. First, LLMs must respond instantly to user requests using a fixed amount of compute, without deliberation even for complex safety scenarios. Second, LLMs must infer underlying safety standards indirectly from large sets of labeled examples, rather than directly learning the safety specifications that govern them. This reliance on implicit, pattern-based learning leads to poor data efficiency and makes it challenging for models to generalize when facing unfamiliar scenarios or adversarial attacks. We propose deliberative alignment, a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer. By applying this method to OpenAI's o-series models [1], we enable them to use chain-of-thought (CoT) reasoning to examine user prompts, identify relevant policy guidelines, and generate safer responses (e.g., Figure 1).


Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

arXiv.org Artificial Intelligence

Automated red teaming can discover rare model failures and generate challenging examples that can be used for training or evaluation. However, a core challenge in automated red teaming is ensuring that the attacks are both diverse and effective. Prior methods typically succeed in optimizing either for diversity or for effectiveness, but rarely both. In this paper, we provide methods that enable automated red teaming to generate a large number of diverse and successful attacks. Our approach decomposes the task into two steps: (1) automated methods for generating diverse attack goals and (2) generating effective attacks for those goals. While we provide multiple straightforward methods for generating diverse goals, our key contributions are to train an RL attacker that both follows those goals and generates diverse attacks for those goals. First, we demonstrate that it is easy to use a large language model (LLM) to generate diverse attacker goals with per-goal prompts and rewards, including rule-based rewards (RBRs) to grade whether the attacks are successful for the particular goal. Second, we demonstrate how training the attacker model with multi-step RL, where the model is rewarded for generating attacks that are different from past attempts further increases diversity while remaining effective. We use our approach to generate both prompt injection attacks and prompts that elicit unsafe responses. In both cases, we find that our approach is able to generate highly-effective and considerably more diverse attacks than past general red-teaming approaches.


OpenAI o1 System Card

arXiv.org Artificial Intelligence

The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.


Rule Based Rewards for Language Model Safety

arXiv.org Artificial Intelligence

Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a costly need to add or relabel data to modify safety behavior. We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors (e.g. refusals should not be judgmental) along with a LLM grader. In contrast to prior methods using AI feedback, our method uses fine-grained, composable, LLM-graded few-shot prompts as reward directly in RL training, resulting in greater control, accuracy and ease of updating. We show that RBRs are an effective training method, achieving an F1 score of 97.1, compared to a human-feedback baseline of 91.7, resulting in much higher safety-behavior accuracy through better balancing usefulness and safety.


GPT-4o System Card

arXiv.org Artificial Intelligence

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.


First-Person Fairness in Chatbots

arXiv.org Artificial Intelligence

Chatbots like ChatGPT are used for diverse purposes, ranging from resume writing to entertainment. These real-world applications are different from the institutional uses, such as resume screening or credit scoring, which have been the focus of much of AI research on fairness. Ensuring equitable treatment for all users in these first-person contexts is critical. In this work, we study "first-person fairness," which means fairness toward the chatbot user. This includes providing high-quality responses to all users regardless of their identity or background and avoiding harmful stereotypes. We propose a scalable, privacy-preserving method for evaluating one aspect of first-person fairness across a large, heterogeneous corpus of real-world chatbot interactions. Specifically, we assess potential bias linked to users' names, which can serve as proxies for demographic attributes like gender or race, in chatbot systems such as ChatGPT, which provide mechanisms for storing and using user names. Our method leverages a second language model to privately analyze name-sensitivity in the chatbot's responses. We verify the validity of these annotations through independent human evaluation. Further, we show that post-training interventions, including RL, significantly mitigate harmful stereotypes. Our approach also yields succinct descriptions of response differences across tasks. For instance, in the "writing a story" task, chatbot responses show a tendency to create protagonists whose gender matches the likely gender inferred from the user's name. Moreover, a pattern emerges where users with female-associated names receive responses with friendlier and simpler language slightly more often than users with male-associated names. Finally, we provide the system messages required for external researchers to further investigate ChatGPT's behavior with hypothetical user profiles.


The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

arXiv.org Artificial Intelligence

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose an automated data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to LLMs, showing that it drastically increases robustness--even for attack types not seen during training--while imposing minimal degradations on standard capabilities.


Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing

arXiv.org Machine Learning

Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that Rรฉnyi entropy of order 2/3 of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. Machine learning (ML) algorithms are increasingly used in domains of consequence such as hiring [1], lending [2], [3], policing [4], and healthcare [5], [6].


Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

arXiv.org Artificial Intelligence

"Effective robustness" measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data. To do this, we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of effective robustness when there are models with different training data. It may also explain the surprising effective robustness gains of zero-shot CLIP-like models exhibited in prior works that used ImageNet as the only ID test set, while the gains diminish under our new evaluation.


Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting

arXiv.org Artificial Intelligence

A crucial challenge for generative large language models (LLMs) is diversity: when a user's prompt is under-specified, models may follow implicit assumptions while generating a response, which may result in homogenization of the responses, as well as certain demographic groups being under-represented or even erased from the generated responses. In this paper, we formalize diversity of representation in generative LLMs. We present evaluation datasets and propose metrics to measure diversity in generated responses along people and culture axes. We find that LLMs understand the notion of diversity, and that they can reason and critique their own responses for that goal. This finding motivated a new prompting technique called collective-critique and self-voting (CCSV) to self-improve people diversity of LLMs by tapping into its diversity reasoning capabilities, without relying on handcrafted examples or prompt tuning. Extensive empirical experiments with both human and automated evaluations show that our proposed approach is effective at improving people and culture diversity, and outperforms all baseline methods by a large margin.