The Impact of Depth and Width on Transformer Language Model Generalization

Open in new window