Linearly Decoding Refused Knowledge in Aligned Language Models

Open in new window