Tutorial #17: Transformers III Training

#artificialintelligence 

In part I of this tutorial we introduced the self-attention mechanism and the transformer architecture. In part II, we discussed position encoding and how to extend the transformer to longer sequence lengths. We also discussed connections between the transformer and other machine learning models. In this final part, we discuss challenges with transformer training dynamics and introduce some of the tricks that practitioners use to get transformers to converge. This discussion will be suitable for researchers who already understand the transformer architecture, and who are interested in training transformers and similar models from scratch. Despite their broad applications, transformers are surprisingly difficult to train from scratch. The input consists of a $I\times D$ matrix containing the $D$ dimensional embeddings for each of the $I$ input tokens.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found