Revealing the Dark Secrets of BERT
Kovaleva, Olga, Romanov, Alexey, Rogers, Anna, Rumshisky, Anna
BERT -based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT's heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametriza-tion. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models. 1 Introduction Over the past year, models based on the Transformer architecture (V aswani et al., 2017) have become the de-facto standard for state-of-the-art performance on many natural language processing (NLP) tasks (Radford et al., 2018; Devlin et al., 2018). Their key feature is the self-attention mechanism that provides an alternative to conventionally used recurrent neural networks (RNN).
Aug-21-2019
- Country:
- North America > United States > Massachusetts > Middlesex County > Lowell (0.14)
- Genre:
- Research Report > New Finding (1.00)
- Technology: