Improving Neuron-level Interpretability with White-box Language Models

Open in new window