Effective Theory of Transformers at Initialization

Dinan, Emily, Yaida, Sho, Zhang, Susan

arXiv.org Artificial Intelligence 

This introduction paves the way for our effective-theory analysis of the backward path in I 3, where we'll figure out how to scale a relative learning-rate factor for each group of model parameters in Transformers. A. Vanilla SGD The SGD update equation is given by θ µ(t) = θ µ( t 1) η t L A t θ µ null null null null θ = θ (t 1), (1.87) where the model-parameter index µ runs over all the P model parameters θ µ in the architecture, η t is a learning rate at iteration t, L A t denotes a loss function evaluated on a minibatch A t at iteration t, and θ µ(0) are drawn from the initialization distribution that was extensively discussed in I 1. 20 In this standard form, we assign the single learning rate η t for all the model parameters, but in theory we'll soon find that the learning rate for each group G of model parameters must be scaled differently as we embiggen Transformers.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found