Effective Theory of Transformers at Initialization

Apr-4-2023–arXiv.org Artificial Intelligence

This introduction paves the way for our effective-theory analysis of the backward path in I 3, where we'll figure out how to scale a relative learning-rate factor for each group of model parameters in Transformers. A. Vanilla SGD The SGD update equation is given by θ µ(t) = θ µ( t 1) η t L A t θ µ null null null null θ = θ (t 1), (1.87) where the model-parameter index µ runs over all the P model parameters θ µ in the architecture, η t is a learning rate at iteration t, L A t denotes a loss function evaluated on a minibatch A t at iteration t, and θ µ(0) are drawn from the initialization distribution that was extensively discussed in I 1. 20 In this standard form, we assign the single learning rate η t for all the model parameters, but in theory we'll soon find that the learning rate for each group G of model parameters must be scaled differently as we embiggen Transformers.

artificial intelligence, machine learning, null, (17 more...)

arXiv.org Artificial Intelligence

Apr-4-2023

arXiv.org PDF

Add feedback

Country:
- Europe (0.27)

Genre:
- Research Report (0.81)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.92)
    - Statistical Learning (0.68)
  - Natural Language > Large Language Model (0.92)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found