An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Shairah, Harethah Abu, Hammoud, Hasan Abed Al Kader, Ghanem, Bernard, Turkiyyah, George
–arXiv.org Artificial Intelligence
Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.
arXiv.org Artificial Intelligence
Oct-8-2025
- Country:
- Asia
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.14)
- Thailand > Bangkok
- Bangkok (0.04)
- Middle East > UAE
- North America > United States
- Florida > Miami-Dade County > Miami (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Technology: