DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD

Open in new window