Safety Pretraining: Toward the Next Generation of Safe AI

Maini, Pratyush, Goyal, Sachin, Sam, Dylan, Robey, Alex, Savani, Yash, Jiang, Yiding, Zou, Andy, Fredrikson, Matt, Lipton, Zacharcy C., Kolter, J. Zico

Sep-16-2025–arXiv.org Artificial Intelligence

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Sep-16-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)

Genre:
- Instructional Material (0.92)
- Research Report > New Finding (0.46)

Industry:
- Information Technology > Security & Privacy (1.00)
- Media (0.93)
- Education > Educational Setting (0.68)
- Law Enforcement & Public Safety
  - Crime Prevention & Enforcement (1.00)
  - Fraud (0.92)
  - Terrorism (0.67)
- Law
  - Criminal Law (1.00)
  - Civil Rights & Constitutional Law (1.00)
- Health & Medicine
  - Pharmaceuticals & Biotechnology (0.93)
  - Consumer Health (0.68)
  - Therapeutic Area > Psychiatry/Psychology
    - Mental Health (0.67)
- Government
  - Military (0.92)
  - Voting & Elections (0.67)
  - Regional Government > North America Government
    - United States Government (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found