Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Open in new window