Visualizing and Measuring the Geometry of BERT

Reif, Emily, Yuan, Ann, Wattenberg, Martin, Viegas, Fernanda B., Coenen, Andy, Pearce, Adam, Kim, Been

Neural Information Processing Systems 

Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces.