DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD