Revealing the Dark Secrets of BERT

Kovaleva, Olga, Romanov, Alexey, Rogers, Anna, Rumshisky, Anna

Aug-21-2019–arXiv.org Machine Learning

BERT -based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT's heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametriza-tion. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models. 1 Introduction Over the past year, models based on the Transformer architecture (V aswani et al., 2017) have become the de-facto standard for state-of-the-art performance on many natural language processing (NLP) tasks (Radford et al., 2018; Devlin et al., 2018). Their key feature is the self-attention mechanism that provides an alternative to conventionally used recurrent neural networks (RNN).

bert, deep learning, neural network, (21 more...)

arXiv.org Machine Learning

Aug-21-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States > Massachusetts > Middlesex County > Lowell (0.14)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found