Refusal in Language Models Is Mediated by a Single Direction Andy Arditi Independent Oscar Obeso

Neural Information Processing Systems 

While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found