Deliberative Alignment: Reasoning Enables Safer Language Models

Guan, Melody Y., Joglekar, Manas, Wallace, Eric, Jain, Saachi, Barak, Boaz, Helyar, Alec, Dias, Rachel, Vallone, Andrea, Ren, Hongyu, Wei, Jason, Chung, Hyung Won, Toyer, Sam, Heidecke, Johannes, Beutel, Alex, Glaese, Amelia

arXiv.org Artificial Intelligence 

Modern Large Language Models (LLMs) are safety trained using Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to mitigate harmful, undesirable, or otherwise disallowed outputs [2]-[4]. Despite ongoing advances in these methods, today's models still exhibit safety shortcomings: they can be tricked into revealing harmful content, often refuse legitimate requests, and remain vulnerable to jailbreak attacks [5]-[8]. We argue that many of these failures arise from two limitations in modern safety training. First, LLMs must respond instantly to user requests using a fixed amount of compute, without deliberation even for complex safety scenarios. Second, LLMs must infer underlying safety standards indirectly from large sets of labeled examples, rather than directly learning the safety specifications that govern them. This reliance on implicit, pattern-based learning leads to poor data efficiency and makes it challenging for models to generalize when facing unfamiliar scenarios or adversarial attacks. We propose deliberative alignment, a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer. By applying this method to OpenAI's o-series models [1], we enable them to use chain-of-thought (CoT) reasoning to examine user prompts, identify relevant policy guidelines, and generate safer responses (e.g., Figure 1).