On the Optimization and Generalization of Multi-head Attention

Deora, Puneesh, Ghaderi, Rouzbeh, Taheri, Hossein, Thrampoulidis, Christos

Oct-19-2023–arXiv.org Machine Learning

The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

Oct-19-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - Canada > British Columbia (0.04)
  - United States
    - Texas > Travis County
      - Austin (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - California > Santa Barbara County
      - Santa Barbara (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Statistical Learning > Gradient Descent (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found