Safety Pretraining: Toward the Next Generation of Safe AI
Maini, Pratyush, Goyal, Sachin, Sam, Dylan, Robey, Alex, Savani, Yash, Jiang, Yiding, Zou, Andy, Fredrikson, Matt, Lipton, Zacharcy C., Kolter, J. Zico
–arXiv.org Artificial Intelligence
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.
arXiv.org Artificial Intelligence
Sep-16-2025
- Country:
- Asia > Thailand
- North America > United States
- Minnesota
- Saint Louis County > Duluth (0.04)
- St. Louis County > Duluth (0.04)
- Minnesota
- Genre:
- Instructional Material (0.92)
- Research Report > New Finding (0.46)
- Industry:
- Education > Educational Setting (0.68)
- Government
- Military (0.92)
- Regional Government > North America Government
- United States Government (0.67)
- Voting & Elections (0.67)
- Health & Medicine
- Consumer Health (0.68)
- Pharmaceuticals & Biotechnology (0.93)
- Therapeutic Area > Psychiatry/Psychology
- Mental Health (0.67)
- Information Technology > Security & Privacy (1.00)
- Law
- Civil Rights & Constitutional Law (1.00)
- Criminal Law (1.00)
- Law Enforcement & Public Safety
- Crime Prevention & Enforcement (1.00)
- Fraud (0.92)
- Terrorism (0.67)
- Media (0.93)
- Technology: