Safety Pretraining: Toward the Next Generation of Safe AI
–Neural Information Processing Systems
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we synthetically generate pretraining datasets that actively teach models to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer models away from unsafe generations at inference-time. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance degradation on general tasks.
Neural Information Processing Systems
Jun-16-2026, 17:06:55 GMT
- Country:
- North America > United States (0.28)
- Genre:
- Instructional Material (0.67)
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Media (0.93)
- Banking & Finance (0.92)
- Education > Educational Setting (0.68)
- Law Enforcement & Public Safety
- Crime Prevention & Enforcement (1.00)
- Fraud (0.92)
- Terrorism (0.67)
- Law
- Criminal Law (1.00)
- Civil Rights & Constitutional Law (1.00)
- Health & Medicine > Therapeutic Area
- Psychiatry/Psychology > Mental Health (0.67)
- Government
- Voting & Elections (0.67)
- Military (0.67)
- Technology: