Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Open in new window