Refusal in Language Models Is Mediated by a Single Direction Andy Arditi Independent Oscar Obeso
–Neural Information Processing Systems
While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
Neural Information Processing Systems
Oct-10-2025, 21:42:48 GMT
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Latvia > Lubāna Municipality
- Lubāna (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Latvia > Lubāna Municipality
- North America
- Mexico > Puebla (0.04)
- United States
- California > Los Angeles County
- Santa Monica (0.04)
- Maryland (0.04)
- California > Los Angeles County
- Asia > Middle East
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.93)
- Research Report
- Industry:
- Government (1.00)
- Health & Medicine > Therapeutic Area (0.67)
- Information Technology > Security & Privacy (1.00)
- Law > Criminal Law (0.67)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
- Technology: