SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Piras, Giorgio, Mura, Raffaele, Brau, Fabio, Oneto, Luca, Roli, Fabio, Biggio, Battista

Nov-14-2025–arXiv.org Artificial Intelligence

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.68)
- North America > United States (0.46)
- Asia > Russia (0.28)

Genre:
- Research Report
  - Promising Solution (0.48)
  - New Finding (0.46)

Industry:
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Law (0.93)
- Government (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.71)
  - Machine Learning > Statistical Learning (0.68)
  - Representation & Reasoning > Search (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found