The Geometry of Harmfulness in LLMs through Subconcept Probing

Open in new window