Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
–arXiv.org Artificial Intelligence
In this paper, I introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. The task has an adjustable difficulty that can further increase the required number of layers to any arbitrary value. I demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. I find that successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence.
arXiv.org Artificial Intelligence
Nov-18-2024
- Country:
- South America > Peru (0.04)
- North America
- Europe
- United Kingdom > England (0.04)
- Norway (0.04)
- Hungary (0.04)
- Germany (0.04)
- Denmark (0.04)
- France (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Asia
- Africa
- Kenya (0.04)
- Middle East > Egypt
- Cairo Governorate > Cairo (0.04)
- Genre:
- Research Report (0.64)
- Technology: