On the Convergence of Encoder-only Shallow Transformers

Neural Information Processing Systems 

Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found