Safety Pretraining: Toward the Next Generation of Safe AI
Maini, Pratyush, Goyal, Sachin, Sam, Dylan, Robey, Alex, Savani, Yash, Jiang, Yiding, Zou, Andy, Fredrikson, Matt, Lipton, Zacharcy C., Kolter, J. Zico
–arXiv.org Artificial Intelligence
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.
arXiv.org Artificial Intelligence
Sep-16-2025
- Country:
- North America > United States (1.00)
- Genre:
- Instructional Material (0.92)
- Research Report > New Finding (0.46)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Media (0.93)
- Education > Educational Setting (0.68)
- Law Enforcement & Public Safety
- Crime Prevention & Enforcement (1.00)
- Fraud (0.92)
- Terrorism (0.67)
- Law
- Criminal Law (1.00)
- Civil Rights & Constitutional Law (1.00)
- Health & Medicine
- Pharmaceuticals & Biotechnology (0.93)
- Consumer Health (0.68)
- Therapeutic Area > Psychiatry/Psychology
- Mental Health (0.67)
- Government
- Military (0.92)
- Voting & Elections (0.67)
- Regional Government > North America Government
- United States Government (0.67)
- Technology: