Tutorial #17: Transformers III Training

Nov-22-2022, 05:10:41 GMT–#artificialintelligence

In part I of this tutorial we introduced the self-attention mechanism and the transformer architecture. In part II, we discussed position encoding and how to extend the transformer to longer sequence lengths. We also discussed connections between the transformer and other machine learning models. In this final part, we discuss challenges with transformer training dynamics and introduce some of the tricks that practitioners use to get transformers to converge. This discussion will be suitable for researchers who already understand the transformer architecture, and who are interested in training transformers and similar models from scratch. Despite their broad applications, transformers are surprisingly difficult to train from scratch. The input consists of a $I\times D$ matrix containing the $D$ dimensional embeddings for each of the $I$ input tokens.

layer normalization, mathbf, transformer, (14 more...)

#artificialintelligence

Nov-22-2022, 05:10:41 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found