Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Open in new window