Red-teaming Activation Probes using Prompted LLMs

Open in new window