Towards eliciting latent knowledge from LLMs with mechanistic interpretability