On the Pitfalls of Analyzing Individual Neurons in Language Models

Antverg, Omer, Belinkov, Yonatan

arXiv.org Artificial Intelligence 

While many studies have shown that linguistic information is encoded in hidden word representations, few have studied individual neurons, to show how and in which neurons it is encoded. Among these, the common approach is to use an external probe to rank neurons according to their relevance to some linguistic attribute, and to evaluate the obtained ranking using the same probe that produced it. We show two pitfalls in this methodology: 1. We separate them and draw conclusions on each. We show that these are not the same. We compare two recent ranking methods and a simple one we introduce, and evaluate them with regard to both of these aspects. Many studies attempt to interpret language models by predicting different linguistic properties from word representations, an approach called probing classifiers (Adi et al., 2017; Conneau et al., 2018, inter alia). A growing body of work focuses on individual neurons within the representation, attempting to show in which neurons some information is encoded, and whether it is localized (concentrated in a small set of neurons) or dispersed. Such knowledge may allow us to control the model's output (Bau et al., 2019), to reduce the number of parameters in the model (Voita et al., 2019; Sajjad et al., 2020), and to gain a general scientific knowledge of the model. The common methodology is to train a probe to predict some linguistic attribute from a representation, and to use it, in different ways, to rank the neurons of the representation according to their importance for the attribute in question.