A Granular Study of Safety Pretraining under Model Abliteration

Agnihotri, Shashank, Jakubassa, Jonas, Dey, Priyam, Goyal, Sachin, Schiele, Bernt, Radhakrishnan, Venkatesh Babu, Keuper, Margret

Oct-6-2025–arXiv.org Artificial Intelligence

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia > India
  - Karnataka > Bengaluru (0.04)
- Europe
  - Germany
    - Baden-Württemberg (0.04)
    - Saarland (0.04)
  - Latvia > Lubāna Municipality
    - Lubāna (0.04)
- North America > United States
  - Florida > Hillsborough County > University (0.04)

Genre:
- Research Report > Experimental Study (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (0.97)
    - Large Language Model (1.00)