Reviews: Root Mean Square Layer Normalization

Neural Information Processing Systems 

ORIGINALITY: The proposed normalization technique is original in the sense that the main difference in existing normalization techniques (batch, layer, group, instance..) differ only in the dimensions over which the activations are normalized. This paper proposes removing one of the typical steps in the normalization process in order to speed up training, which has been less well-studied - This work proposes dividing by the RMS statistic instead of standard deviation without hurting accuracy. Other works (for example, Santurkar et al.) experiment with scaling by different statistics, such as various l_p norms, without a loss in training accuracy. This work is not the first to suggest scaling the activations by a different statistic QUALITY: The authors tested their technique on multiple deep learning frameworks (TensorFlow, PyTorch, Theano), which gives more support to their empirical results, as different implementations can have very different timing results The authors tested their technique on multiple tasks and neural network architectures - The main hypothesis hypothesis is that the re-centering step in Layer Normalization is dispensable, and this is backed only by experimental results and could be a lot stronger with some theoretical justification - While the few experimental results show that there is no degradation of accuracy from not centering the activations, I am still not fully convinced that the centering step can be deemed unnecessary. For example, it is likely that the weights/biases of the networks in the paper are initialized such that the activations are roughly centered around zero already, and therefore the mean-centering step can be removed without seeing much of a difference in performance.