The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT

Open in new window