The Impact of Depth and Width on Transformer Language Model Generalization