What does self-attention learn from Masked Language Modelling?

Open in new window