Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Open in new window